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Abstract 

This paper studies hypothesis testing and parameter estimation in the context of the 
divide and conquer algorithm. In a unified likelihood based framework, we propose new 
test statistics and point estimators obtained by aggregating various statistics from k 
subsamples of size n/k, where n is the sample size. In both low dimensional and high 
dimensional settings, we address the important question of how to choose k as n grows 
large, providing a theoretical upper bound on k such that the information loss due to 
the divide and conquer algorithm is negligible. In other words, the resulting estimators 
have the same inferential efficiencies and estimation rates as a practically infeasible 
oracle with access to the full sample. Thorough numerical results are provided to back 
up the theory. 


1 Introduction 

In recent years, the field of statistics has developed apace in response to the opportunities and 
challenges spawned from the ‘data revolution’, which marked the dawn of an era characterized 
by the availability of enormous datasets. An extensive toolkit of methodology is now in place for 
addressing a wide range of high dimensional problems, whereby the number of unknown parameters, 
d, is much larger than the number of observations, n. However, many modern datasets are instead 
characterized by n and d both large. The latter presents intimidating practical challenges resulting 
from storage and computational limitations, as well as numerous statistical challenges (Fan et ah, 
2014). It is important that statistical methodology targeting modern application areas does not 
lose sight of the practical burdens associated with manipulating such large scale datasets. In this 
vein, incisive new algorithms have been developed for exploiting modern computing architectures 
and recent advances in distributed computing. These algorithms enjoy computational efficiency 
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and facilitate data handling and storage, but come with a statistical overhead if inappropriately 
tnned. 

With increased mindfulness of the algorithmic difficulties associated with large datasets, the 
statistical community has witnessed a surge in recent activity in the statistical analysis of various 
divide and conquer (DC) algorithms, which randomly partition the n observations into k subsam¬ 
ples of size Uk = n/k, construct statistics based on each subsample, and aggregate them in a 
suitable way. In splitting the dataset, a single, very large scale estimation or inference problem 
with computational complexity 0{'y{n)), for a given function 7 (-) that depends on the underlying 
problem, is transformed into k high dimensional (large d smaller n^) problems each with compn- 
tational complexity 0['y{n/k)) on each machine. What gets lost in this process is the interactions 
of split subsamples in each machine. They are not recoverable. However, the information got lost 
is not much statistically, as the spilt subsamples are supposed to be independent. It is thus of 
significant practical interest to derive a theoretical upper bound on the number of subsamples k 
that delivers the same statistical performance as the computationally infeasible “oracle” procedure 
based on the full sample. We develop communication efficient generalizations of the Wald and Rao 
score tests for the high dimensional scheme, as well as communication efficient estimators for the 
parameters of the high dimensional and low dimensional linear and generalized linear models. In 
all cases we give the npper bound on k for preserving the statistical error of the analogous full 
sample procedure. 

While hypothesis testing in a low dimensional context is straightforward, in the high dimensional 
setting, nuisance parameters introduce a non-negligable bias, causing classical low dimensional the¬ 
ory to break down. In our high dimensional Wald construction, the phenomenon is remedied 
through a debiasing of the estimator, which gives rise to a test statistic with tractable limiting 
distribution, as documented in the k = 1 (no sample split) setting in Zhang and Zhang (2014) and 
van de Geer et al. (2014). For the high dimensional analogue of Rao’s score statistic, the incorpora¬ 
tion of a correction factor increases the convergence rate of higher order terms, thereby vanquishing 
the effect of the nuisance parameters. The approach is introduced in the k = 1 setting in Ning 
and Liu (2014), where the test statistic is shown to possess a tractable limit distribntion. However, 
the computation complexity for the debiased estimators increases by an order of magnitude, due 
to solving d high-dimensional regularization problems. This motivates us to appeal to the divide 
and conqner strategy. 

We develop the theory and methodology for DC versions of these tests. In the case k = 1, each of 
the above test statistics can be decomposed into a dominant term with tractable limit distribution 
and a negligible remainder term. The DC extension requires delicate control of these remainder 
terms to ensure the error accumulation remains sufficiently small so as not to materially contaminate 
the leading term. In obtaining the upper bound on the number of permitted subsamples, k, we 
provide an upper bound on k subject to a statistical guarantee. We find that the theoretical upper 
bound on the number of subsamples guaranteeing the same inferential or estimation efficiency as 
the procedure without DC is A: = o((slogd)“^-y/n) in the linear model, where s is the sparsity of 
the parameter vector. In the generalized linear model the scaling is k = o(((s V si) log ci)“^\/n), 
where si is the sparsity of the inverse information matrix. 
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For high dimensional estimation problems, we use the same debiasing trick introduced in the 
high dimensional testing problems to obtain a thresholded divide and conquer estimator that 
achieves the full sample minimax rate. The appropriate scaling is found to be fc = 0(\/ "nj logd)) 
for the linear model and k = si)^ logd)) for the generalized linear model. Moreover, 

we find that the loss incurred by the divide and conquer strategy, as quantihed by the distance be¬ 
tween the DC estimator and the full sample estimator, is negligible in comparison to the statistical 
error of the full sample estimator provided that k is not too large. In the context of estimation, the 
optimal scaling of k with n and d is also developed for the low dimensional linear and generalized 
linear model. This theory is of independent interest. It also allows us to study a refitted estimation 
procedure under a minimal signal strength assumption. 

A partial list of references covering DC algorithms from a statistical perspective is Chen and 
Xie (2012), Zhang et al. (2013), Kleiner et al. (2014), and Zhao et al. (2014a). For the high 
dimensional estimation setting, the same debiasing approach of van de Geer et al. (2014) is proposed 
independently by Lee et al. (2015) for divide-and-conquer estimation. Our paper differs from that 
of Lee et al. (2015) in that we additionally cover high dimensional hypothesis testing and refitted 
estimation in the DC setting. Our results on hypothesis testing reveals a different phenomenon to 
that found in estimation, as we observe through the different requirements on the scaling of k. On 
the estimation side, our results also differ from those of Lee et al. (2015) in that the additional 
refitting step allows us to achieve the oracle rate under the same scaling of k. 

The rest of the paper is organized as follows. Section 2 collects notation and details of a generic 
likelihood based framework. Section 3 covers testing, providing high dimensional DC analogues 
of the Wald test (Section 3.1) and Rao score test (Section 3.2), in each case deriving a tractable 
limit distribution for the corresponding test statistic under standard assumptions. Section 4 covers 
distributed estimation, proposing an aggregated estimator of f3* for low dimensional and high 
dimensional linear and generalized linear models, as well as a refitting procedure that improves 
the estimation rate, with the same scaling, under a minimal signal strength assumption. Section 5 
provides numerical experiments to back up the developed theory. In Section 6 we discuss our results 
together with remaining future challenges. Proofs of our main results are collected in Section 7, 
while the statement and proofs of a number of technical lemmas are deferred to the appendix. 

2 Background and Notation 

We first collect the general notation, before providing a formal statement of our statistical problems. 
More specialized notation is introduced in context. 

2.1 Generic Notation 

We adopt the common convention of using boldface letters for vectors only, while regular font is used 
for both matrices and scalars, with the context ensuring no ambiguity. | • | denotes both absolute 
value and cardinality of a set, with the context ensuring no ambiguity. For v = {vi ,..., Vd)^ G 
and 1 < g < oo, we define ||r)||g = (J2j=i ll'*^llo = I supp(u)|, where supp(rj) = {j : vj / 0} 

and |A| is the cardinality of the set A. Write ||u||oo = maxi<j<rf \vj\, while for a matrix M = 
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let ||M||max = ||M||i = Ylj k\^jk\- any matrix M we use iVf^ to index the 

transposed row of M and [M]e to index the column. The sub-Gaussian norm of a scalar 
random variable X is defined as ||^||^2 ~ supq>;^ For a random vector X £ 

its sub-Gaussian norm is defined as ||X ||,/,2 = sup^-gg^-i ||(X,®)||^ 2 ) where denotes the unit 
sphere in Let denote the dxd identity matrix; when the dimension is clear from the context, 
we omit the subscript. We also denote the Hadamard product of two matrices A, B as Ao B and 
{AoB)jk = AjkBjk for any j, k. {ei,..., e^} denotes the canonical basis for M^. For a vector v £ 
and a set of indices 5 C {1,..., d}, is the vector of length |5| whose components are {vj:j£S}. 
Additionally, for a vector v with element Vj, we use the notation V-j to denote the remaining 
vector when the element is removed. With slight abuse of notation, we write v = {vj,v-j) when 
we wish to emphasize the dependence of v on Vj and V-j individually. The gradient of a function 
f{x) is denoted by V f{x), while Va, f(^{x,y)) denotes the gradient of f[{x,y)) with respect to x, 
and V'^y f[{x,y)^ denotes the matrix of cross partial derivatives with respect to the elements of 
X and y. For a scalar rj, we simply write f'{ri) := V^/(??) and f"{r]) := Vyyf{ri). For a random 
variable X and a sequence of random variables, Xn, we write Xn X when Xn converges weakly 
to X. If X is a random variable with standard distribution, say Fx, we simply write Xn Fx- 
Given a, 6 G M, let a V 6 and a Ab denote the maximum and minimum of a and b. We also make 
use of the notation an < bn {an > bn) if an is less than (greater than) bn up to a constant, and 
On bn if dn Is the same order as bn- Finally, for an arbitrary function /, we use argzero^ f{6) to 
denote the solution to f{9) = 0. 

2.2 General Likelihood based Framework 

Let {Xf ,..., {Xn ,Yn)'^ be n i.i.d. copies of the random vector whose realiza¬ 

tions take values in X y. Write the collection of these n i.i.d. random couples as "D = 
{(Xf,yi)'^,...,(Xj,y,,)'^} with Y = (yi,...,y„)^ and X = (Xi,...,X„)^ G Condi¬ 

tional on Xi, we assume L) is distributed as for alH G {1,..., n}, where has a density or 
mass function ff 3 *. We thus define the negative log-likelihood function, in{P), as 

- n 1 ^ 

= --Y.^^Sff3{Y^\X,). ( 2 . 1 ) 

i=l i=l 

We use J* = J{P*) to denote the information matrix and 0* to denote (J*)“^, where J{(3) = 

E[v2^4(/3)]. 

For testing problems, our goal is to test Hq : jd* = for any v £ We partition 

(3* as (3* = {jdl, l3*Jn)'^ £ where /3Lt, G is a vector of nuisance parameters and /3* is the 
parameter of interest. To handle the curse of dimensionality, we exploit a penalized M-estimator 
defined as, 

= argmin{4(/3) -F V\{(3)] , (2.2) 

/3 

with 'Px{f3) a sparsity inducing penalty function with a regularization parameter A. Examples of 
V\{(3) include the convex penalty, V\{(3) = A||/3||i = l/^«l which, in the context of the 
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linear model, gives rise to the LASSO estimator (Tibshirani, 1996) 


3lasso = argmin|^||y-A:/3||i + A||/3||i|. (2.3) 

Other penalties include folded concave penalties such as the smoothly clipped absolute deviation 
(SCAD) penalty (Fan and Li, 2001) and minimax concave MCP penalty (Zhang, 2010), which 
eliminate the estimation bias and attain the oracle rates of convergence (Loh and Wainwright, 
2013; Wang et ah, 2014a). The SCAD penalty is defined as 


d 

'PxiP) = where px{t) = 

V = 1 


■|^A1(2; < a) + 


a\-z 

--l(z > A) 

0 — 1 


dz, 


for a given parameter o > 0 and MCP penalty is given by 


(2.4) 


'PxiP) = YlpxiPv), where px{t) = ^ (} - 

where 6 > 0 is a fixed parameter. The only requirement we have on VxiP) is that it induces an 
estimator satisfying the following condition. 

Condition 2.1 . For any 6 G (0,1), if A x x/log{d/6)/n, 

P(||3^ - p*\\i > Csn-i/Vlog(d/<5)) < <5, (2.6) 

where s is the sparsity of P*, i.e., s = ||/3*||o- 

Condition 2.1 holds for the LASSO, SCAD and MCP. See Biihlmann and van de Geer (2011); 
Fan and Li (2001); Zhang (2010) respectively and Zhang and Zhang (2012). 

The DC algorithm randomly and evenly partitions D into k disjoint subsets Di,... ,1)^, so 
that V = Vj n = 0 for all G {1, • • • ,k}, and |Ili| = \T> 2 \ = ■ ■ ■ = |T>fc| = Uk = 

n/k, where it is implicitly assumed that n can be divided evenly. Let Ij C {l,...,n} be the 
index set corresponding to the elements of Dj. Then for an arbitrary n x d matrix A, = 
[Ai£\i^Xj,i<i<d- For an arbitrary estimator r, we write ^{Vj) when the estimator is constructed based 
only on Dj. What information gets lost in this process is the interactions of data across subsamples 
Taking the oridinary least-squares regression, for example, the cross-covariances of the 
subsamples will not be able to get recovered. However, they do not contain much information 
about the unknown parameters, as the subsamples are nearly independent. Finally, we write 
(/3) — (/3) to denote the negative log-likelihood function of equation (2.1) based on Dj. 

While the results of this paper hold in a general likelihood based framework, for simplicity we 
state conditions at the population level for the generalized linear model (GLM) with canonical link. 
A much more general set of statements appear in the auxiliary lemmas upon which our main results 
are based. Under GLM with the canonical link, the response follows the distribution. 


/4r;A,r)=n/(^-^*)=n 

i=l i=l 


|c(yi)exp 




zMl 

</> 


(2.7) 
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where = Xf (3*. The negative log-likelihood corresponding to (2.7) is given, up to an affine 
transformation, by 


^ /6 - 

4(/3) = - V -Y,Xff3 + h{Xf(5) = - V -y.r?, + 6(r?,) = -Te^{f3), (2.8) 

n n n 

2=1 2 = 1 2 = 1 

and the gradient and Hessian of in{(3) are respectively 

Vej/S) = --X^(Y - fJ,(X(3)) and V^iJP) = -X^D(XP)X, 
n n 

where /x(/3) = {b'{r]i),... ,b'{r]n))'^ and D{P) = diag{6"(r/i),..., 6"(r/„)}. In this setting, J(/3) = 
E[b"{Xlp)XiXf] and J* =E[b"{Xfl3*)XiXf]. 


3 Divide and Conquer Hypothesis Tests 


In the context of the two classical testing frameworks, the Wald and Rao score tests, our objective 
is to construct a test statistic Sn with low communication cost and a tractable limiting distribution 
F. Let P* be the component of jS*. From this statistic we define a test of size a of the null 
hypothesis, Hq : /3* = [51/, against the alternative, H\ : [5* [5!/, as a partition of the sample space 

described by 


0 ii\Sn\<F-\l-a/2) 
1 if |;S„| >F-i(l-a/2) 


for a two sided test. 


3.1 Two Divide and Conquer Wald Type Constructions 

For the high dimensional linear model, Zhang and Zhang (2014), van de Geer et al. (2014) and 
Javanmard and Montanari (2014) propose methods for debiasing the LASSO estimator with a 
view to constructing high dimensional analogues of Wald statistics and confidence intervals for 
low-dimensional coordinates. As pointed out by Zhang and Zhang (2014), the debiased estimator 
does not impose the minimum signal condition used in establishing oracle properties of regularized 
estimators (Fan and Li, 2001; Fan and Lv, 2011; Loh and Wainwright, 2015; Wang et ah, 2014b; 
Zhang and Zhang, 2012) and hence has wider applicability than those inferences based on the oracle 
properties. The method of van de Geer et al. (2014) is appealing in that it accommodates a general 
penalized likelihood based framework, while the Javanmard and Montanari (2014) approach is 
appealing in that it optimizes asymptotic variance and requires a weaker condition than van de Geer 
et al. (2014) in the specific case of the linear model. We consider the DG analogues of Javanmard 
and Montanari (2014) and van de Geer et al. (2014) in Sections 3.1.1 and 3.1.2 respectively. 

3.1.1 LASSO based Wald Test for the Linear Model 

The linear model assumes 

Yi = X/p*+ei, (3.2) 
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where are i.i.d. with mean zero and variance For concreteness, we focus on a LASSO 

based method, but our procedure is also valid when other pilot estimators are used. We describe 
a modification of the bias correction method introduced in Javanmard and Montanari (2014) as a 
means to testing hypotheses on low dimensional coordinates of (3* via pivotal test statistics. 

On each subset we compute the debiased estimator of [3* as in Javanmard and Montanari 
(2014) as 

3"(^,) = 3^asso(^.) + (3.3) 

where the superscript d is used to indicate the debiased version of the estimator, ..., 

and nit, is the solution of 

= argminm'^E^.^^m s.t. ||g(i)m - e.y||oo < ill, (3.4) 

m 

||a(1W||oo < 112, 


where the choice of tuning parameters ili and 1 I 2 is discussed in Javanmard and Montanari (2014) 
and Zhao et al. (2014a). Above, is the sample covariance based 

on Vj, whose population counterpart is E = E(XiX^) and is its regurlized inverse. The 

second term in (3.3) is a bias correction term, while is shown in Javanmard 

and Montanari (2014) to be the variance of the component of /3'^(2Jj). The parameter ili, 
which tends to zero, controls the bias of the debiased estimator (3.3) and the optimization in (3.4) 
minimizes the variance of the resulting estimator. 

Solving d optimization problems in (3.4) increase an order of magnitude of computation com¬ 
plexity even for A; = 1. Thus, it is necessary to appeal to the divide and conquer strategy to reduce 
the computation burden. This gives rise to the question how large k can be in order to maintain 
the same statistical properties as the whole sample one {k = 1). 

Because our DC procedure gives rise to smaller samples, we overcome the singularity in E 
through a change of variables. More specifically, noting that is not required explicitly, but 
rather the product , we propose 



bii)Tbii) 

= argmin- 

b 


S.t. 


rik 


<iii, 


||b(^')||oo<ll2, 


from which we construct , where B = ( 61 ,..., b^)- 

The following conditions on the data generating process and the tail behavior of the design 
vectors are imposed in Javanmard and Montanari (2014). Both conditions are used to derive the 
theoretical properties of the DC Wald test statistic based on the aggregated debiased estimator. 

Condition 3.1 . {(Tj,Xj)}”^^ are i.i.d. and E satisfies 0 < Cmin < Amin (E)<A 

max (E) < C*max. 

Condition 3.2 . The rows of X are sub-Gaussian with ||Xj ||^2 Y k, i = 1,... ,n. 
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Note that under the two conditions above, there exists a constant ki > 0 such that ||XiS“ 2 < 

Ki- Without loss of generality, we can set ki = n. Our first main theorem provides the relative 
scaling of the various tuning parameters involved in the construction of (3 . 


Theorem 3.3. Suppose Conditions 2.1, 3.1 and 3.2 are fulfilled. Suppose IE[ei] < 00 and choose 
I?!, and k such that "di x \/k\ogd/n, = o(l) and k = o{{slogd)~^y/n). For any 


u G {1,... ,d}. 





N{0, fj^). 


(3.5) 


where Qv'^ = 


Theorem 3.3 entertains the prospect of a divide and conquer Wald statistic of the form 


S 


n 




(3.6) 


for /?*, where a is an estimator for a based on the k subsamples. On the left hand side of equation 
(3.6) we suppress the dependence on v to simplify notation. As an estimator for a, a simple 
suggestion with the same computational complexity is a where 


= F a\V,) = ^ (3.7) 

^ • 1 •^'7- 

J —1 

One can use the refitted cross-validation procedure of Fan et al. (2012) to reduce the bias of the 
estimate. In Lemma 3.4 we show that with the scaling of k and A required for the weak convergence 
results of Theorem 3.3, consistency of is also achieved. 

Lemma 3.4. Suppose E[ej|Wj] = 0 for all i G {l,...,n}. Then with A x yjk log d/n and k = 
o{y/n{s log d)~^), 1^2 - fj^l = op(l). 

With Lemma 3.4 and Theorem 3.3 at hand, we establish in Corollary 3.5 the asymptotic distri¬ 
bution of Sn under the null hypothesis Hq : (3* = . This holds for each component u G {1,..., d}. 

Corollary 3.5. Suppose Conditions 3.1 and 3.2 are fulfilled, E[ef] < 00 , and A, and '&2 are chosen 
as A X y^/clog d/n, -di x ■s/klog~d/n and d 2 n“^/^ = o(l). Then provided k = o((slogd)“^\/n), 
under Hq : (3* = f3^, we have 


lim sup|P(5,i ^ 0 “ ‘^(^)l = 6) 


where 4>(.) is the cdf of a standard normal distribution. 


(3.8) 


3.1.2 Wald Test in the Likelihood Based Framework 

An alternative route to debiasing the LASSO estimator of (3* is the one proposed in van de Geer 
et al. (2014). Their so called desparsified estimator of (3* is more general than the debiased estimator 








of Javanmard and Montanari (2014) in that it accommodates generic estimators of the form (2.2) as 
pilot estimators, but the latter optimizes the variance of the resulting estimator. The desparsified 
estimator for subsample T)j is 

( 3 . 9 ) 

where 0^-^^ is a regularized inverse of the Hessian matrix of second order derivatives of in], (/3) at 
(3^{'Dj), denoted by in] (/3^(2?j)). We will make this explicit in due course. The estimator 

resembles the classical one-step estimator (Bickel, 1975), but now in the high-dimensional setting 
via regularized inverse of the Hessian matrix which reduces to the empirical covariance of 
the design matrix in the case of the linear model. From equation (3.9), the aggregated debiased 
estimator over the k subsamples is defined as = k~^ 

We now use the nodewise LASSO (Meinshausen and Biihlmann, 2006) to approximately invert 
via Li-regularization. The basic idea is to find the regularized invert row by row via a penalized 
Li-regression, which is the same as regressing the variable Xy on X-y but expressed in the sample 
covariance form. For each row v G 1,... ,d, consider the optimization 

Ky{Vj) = argmin (j]i^ - 2j]]]_yK + -h 2 A„||k||i), (3.10) 

where J^^Ly denotes the row of without the {v, u)*'^ diagonal element, and j]]] _y is the 
principal submatrix without the row and column. Introduce 


:= 


( 1 -Kl^2{Vj) ... -Kl4{Vj)\ 

-K2,l{Vj) 1 ... -K2,d(7^i) 


/ 


(3.11) 


-Kd,2{T^j) 1 

and = diag(Ti(Pj),... ,Td{'Dj)), where Ty{Vj)‘^ = — j]^]LyKy{Vj). 0^^^ in equation (3.9) is 

given by 

0(i) ^ (S(i))-2^(i), (3.12) 

and we define &y^ as the transposed row of 0 ^^h 

Theorem 3.8 establishes the limit distribution of the term. 


Sn = 


1 ^ 
i=i 




H 


(3.13) 


for any v G {1,... ,d} under the null hypothesis Hq : /3* = /3^. This provides the basis for the 
statistical inference based on divide-and-conquer. We need the following condition. Recall that 
J* = i 3 i 3 in{P*)] and consider the generalized linear model (2.7). 

Condition 3.6 . (i) {(17, are i.i.d., 0 < Cmin < Amin(H) < Amax(H) < Cmax, Xmm( J*) > 

Lmin > 0, 11 J* 11 max < Hi < oo. (ii) For some constant M < oo, maxi<i<n,|X?’/3*| < M and 
maxi<j<„ ||ACj||oo < M. (iii) There exist finite constants U 2 ,U 3 > 0 such that b”{r]) < U 2 and 
b'"{r]) < U 3 for all r/ G M. 
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The same assumptions appear in van de Geer et al. (2014). In the case of the Gaussian GLM, 
the condition on Aniin(<^*) reduces to the requirement that the covariance of the design has minimal 
eigenvalue bounded away from zero, which is a standard assumption. We require || J*||max < oo to 
control the estimation error of different functionals of J*. The restriction in (ii) on the covariates 
and the projection of the covariates are imposed for technical simplicity; it can be extended to the 
case of exponential tails (see Fan and Song, 2010). Note that Var(yi) = (j)b"{Xf (3*) where cf) is the 
dispersion parameter in (2.7), so h"{ri) < U 2 essentially implies an upper bound on the variance 
of the response. In fact, Lemma A.2 shows that h"{rf) < U 2 can guarantee that the response is 
sub-gaussian. h"'{ri) < is used to derive the Lipschitz property of ^'{Xj(3) with respect to /3 as 
shown in Lemma A.5. We emphasize that no requirement in Condition 3.6 is specific to the divide 
and conquer framework. 

The assumption of bounded design in (ii) can be relaxed to the sub-gaussian design. However, 
the price to pay is that the allowable number of subsets k is smaller than the bounded case, which 
means we need a larger sub-sample size. To be more precise, the order of maximum k for the 
sub-gaussian design has an extra factor, which is a polynomial of \/log d, compared to the order for 
the bounded design. This logrithmic factor comes from different Lipschitz properties of h"{Xf (3) in 
the two designs, which is fully explained in Lemma A.5 of the appendix. In the following theorems, 
we only present results for the case of bounded design for technical simplicity. 

In addition, recalling that 0* = (J*)“^, where J* := J{j3*) = E[V^^ I'„(/3*)], we impose 
Condition 3.7 on 0* and its estimator 0. 

Condition 3.7 . (i) mini<.u<d 0*.^ > 6*min > 0. (ii) maxi<i<„ ||oo < M. (hi) For v = 

1,..., d, whenever x sjk log d/n in (3.10), we have 

P (^11©^ - 0 *||i > Csiv^log d/n^ < 

where C is a constant and si is such that ||0^||o ^ for all u G {1,..., d}. 

Part (i) of Corollary 3.7 ensures that the variances of each component of the debiased estimator 
exist, guaranteeing the existence of the Wald statistic. Parts (ii) and (hi) are imposed directly for 
technical simplicity. Results of this nature have been established under a similar set of assumptions 
in van de Geer et al. (2014) and Negahban et al. (2009) for convex penalties and in Wang et al. 
(2014a) and Loh and Wainwright (2015) for folded concave penalties. 

As a step towards deriving the limit distribution of the proposed divide and conquer Wald 
statistic in the GLM framework, we establish the asymptotic behavior of the aggregated debiased 
estimator f3^ for every given v G [d]. 

Theorem 3.8. Under Conditions 2.1, 3.6 and 3.7, with A x log d/n, we have 

= V^g(r) + op(n-V2) (3.14) 

j=i 

for any k d satisfying k = o(((s V si) \ogd)~^y/n), where 0|(^ is the transposed row of 

A corollary of Theorem 3.8 provides the asymptotic distribution of the Wald statistic in equation 
(3.13) under the null hypothesis. 
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Corollary 3.9. Let Sn be as in equation (3.13), with replaced with an estimator 0^^. Then 
under the conditions of Theorem 3.8 and Hq : /3* = [3^, provided — 0^^| = op(l) under the 
scaling k = o(((s V si) logd)~^^/n), we have 

lim sup|P(S'„, <t)— <h(t)| = 0. 

Remark 3.10. Although Theorem 3.8 and Corollary 3.9 are stated only for the GLM, their proofs 
are in fact an application of two more general results. Further details are available in Lemmas A.7 
and A.8 of the appendix. 

We return to the issue of estimating 0*^ in Section 4, where we introduce an consistent estimator 
of 0*^ that preserves the scaling of Theorem 3.8 and Corollary 3.9. 


3.2 Divide and Conquer Score Test 


In this section, we use V^/(/3) and S/-yf{P) to denote, respectively, the partial derivative of / 
with respect to (3^ and the partial derivative vector of / with respect to P-v f{P), fiP)^ 

^-v,v fiP) fiP) are analogously defined. 

In the low dimensional setting (where d is fixed), Rao’s score test of Hq : P* = against 
Hi : P* P^ is based on in{P^ , P-v), where P-y is a constrained maximum likelihood estimator 
of PPy, constructed as P-y = argmin^_^ £„(/3^,/3_^,) = argmax^_^{—/3_^,)}. If Hq is false, 
imposing the constraint postulated by Hq significantly violates the first order conditions from M- 
estimation with high probability; this is the principle underpinning the classical score test. Under 
regularity conditions, it can be shown (e.g. Cox and Hinkley, 1974) that 

^/n(V,4(/?.^,/3_,)) ^ N{0, 1), 


where J*^_^ is given by J*^_^ = J*^y - Jl_yJ*_^]_^J*_y^y, 
partitions of the information matrix J* = J(P*), 


with J*„, 


' V, — V 5 


JPy-v and Jly^y the 


m 


Ji) 


J ?J. —7 


^v,—v 


W^lytniP) ^Vl_yln{P) \ 
^Vly^ylniP) ^Vly^_yUP) ) ■ 


(3.15) 


The problems associated with the use of the classical score statistic in the presence of a high 
dimensional nuisance parameter are brought to light by Ning and Liu (2014), who propose a remedy 
via the decorrelated score. The problem stems from the inversion of the matrix JPy_y in high 
dimensions. The decorrelated score is defined as 


Sip*y,P*_y)=Vy£niP:,P-y)-W 


*T 


T* —1 


V.yin{P:,P*-y), where 


(3.16) 


For a regularized estimator w of w*, to be defined below, a mean value expansion of 

s{p:,p\) := VyUP: 3 \) - w'^v-yUP*y 3 \) ( 3 . 17 ) 

around P*_^ gives 

S{Pl,P-y) = Vyin{P*y,PPy)-W^V.yiniP:,P*_y) 

+ [Vl_yin{P*y,P-y,a) - W^V\_yin{P*y, P-v,a)] {P\ " P-y), (3.18) 
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where = of3^v + (1 “ o:)P-v ct G [0,1]. The key to understanding how the decorrelated 

score remedies the problems faced by the classical score test is the observation that 


(3.19) 


where w*'^ = Hence, provided w* is sufficiently sparse to avoid excessive noise 

accumulation, we are able to achieve rate acceleration in equation (3.18), ultimately giving rise to 
a tractable limit distribution of a suitable rescaling of 5'(/3*,/3^^). Since /3* is restricted under the 
null hypothesis, Hq : {3* = (3^ , the statistic in equation (3.17) is accessible once Hq is imposed. As 
Ning and Liu (2014) point out, w* is the solution to 


w* = argminE[V,,£n(/3f ,/31„) - V_„ 4(/3^,/31^)]^ 

W 


under Hq : (3* = (3^ . We thus see that the population analogue of the decorrelation device is the 
linear combination w*'^ V-y iniPy , P-v) that best approximates VyiniPy , PPy) in a least squares 
sense. 

Our divide and conquer score statistic under Hq : P* = Py is 




i=i 


(3.20) 


where S^^\Py,p\{Vj)) = 
wpDj) = argmin llwlli, s.t. 

W 


Vy£(p^(PyJ^y(Vp)-w(V,fV.y£(p^(PyJ^y(Vj)) and 

{P^y{VjPP\{V,)) -w^V\_y£^l {p^{VpJ^_y{Vj)) 


Equation (3.21) is the Dantzig selector of Candes and Tao (2007). 
Theorem 3.11. Let Jy\-y be a consistent estimator of J*^_y and 


< Ai- 

DO 

(3.21) 


S^^\P^,P*_y) = Vy£^l{P^,P*_y) - W*^V.y6M,P- 


Suppose 111^*111 < Si and Conditions 2.1 and 3.6 are fulfilled. Then under Hq : /?* = Py with 
A X /X X Y^fclog d/n. 


\/^5(/3f) = VS'(^)(/3f,/31^) + op(l) and lim sup|E(5(/3f) J < t) - $(t) | = 0, 

K n^oo I 


for any k d satisfying k = o(((s V si) logd) where S{Py ) is defined in equation (3.20). 

Remark 3.12. By the definition of w* and the block matrix inversion formula for 0* = (J*)“^, 
sparsity of w* is implied by sparsity of 0* as assumed in van de Geer et al. (2014) and Condition 
3.7 of Section 3.1.2. In turn, ||re*||o ^ si implies ||'m*||i < si provided that the elements of w* are 
bounded. 
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Remark 3.13. Although Theorem 3.11 is stated in the penalized GLM setting, the result holds 
more generally; further details are available in Lemma A. 13 of Appendix A in the Supplementary 
Material. 

To maintain the same computational complexity, an estimator of the conditional information 
needs to be constructed using a DC procedure. For this, we propose to use 

k 

i=i 

where Ej=i P-v = j) and w = % Lemma 

3.14, this estimator is asymptotically consistent. 

Lemma 3.14. Suppose ||u;*||i = 0(si) and Conditions 2.1 and 3.6 are fulfilled. Then for any 
k^d satisfying k = o(((s V si) logd)“^0i), \ Jy\_y - J*\_J = op(l). 

4 Accuracy of Distributed Estimation 

As explained in Section 2.2, the information got lost in the divide-and-conquer process is not very 
much. This motivates us to consider ||/3 —/3'^||2, the loss incurred by the divide and conquer strategy 
in comparison with the computationally infeasible full sample debiased estimator /3‘^. Indeed, it 
turns out that, for k not too large, (3 — appears only as a higher order term in the decomposition 
of /3 — (5* and thus ||/3 —/3'^||2 is negligible compared to the statistical error, \\(3‘^ — (3* \\ 2 - In other 
words, the divide-and-conquer errors are statistically negligible. 

When the minimum signal strength is sufficiently strong, thresholding (3 achieves exact support 
recovery, motivating a refitting procedure based on the low dimensional selected variables. As a 
means to understanding the theoretical properties of this rehtting procedure, as well as for indepen¬ 
dent interest, this section develops new theory and methodology for the low dimensional {d < n) 
linear and generalized linear models in addition to their high dimensional {d ^ n) counterparts. 
It turns out that simple averaging of low dimensional OLS or CLM estimators (denoted uniformly 
as I3^^\ without superscript d as debias is not necessary) suffices to preserve the statistical error, 
i.e., achieving the same statistical accuracy as the estimator based on the whole data set. This 
is because, in contrast to the high dimensional setting, parameters are not penalized in the low 
dimensional case. With (3 the average of over the k machines and (3 the full sample counterpart 
(fc = 1), we derive the rate of convergence of ||/3 — /3||2. Refitted estimation using only the selected 
covariates allows us to eliminate a log d term in the statistical rate of convergence of the estimator. 
We present the high dimensional and low dimensional results separately, with the analysis of the 
refitting procedures appearing as corollaries to the low dimensional analysis. 

4.1 The High-Dimensional Linear Model 

Recall that the high dimensional DC estimator is where (3‘^{'Dj) for 1 < 

j < k is the debiased estimator defined in (3.3). We also denote the debiased LASSO estimator 
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using the entire dataset as (3'^ = The following lemma shows that not only is 0^ 

asymptotically normal, it approximates the full sample estimator /3'^ so well that it has the same 
statistical error as /3'^ provided the number of subsamples k is not too large. 

Lemma 4.1. Under the Conditions 3.1 and 3.2, if A, and ^2 are chosen as A x 0k log d/n, 
■di X 0 k log d/n and = o(l), we have with probability 1 — c/d, 


and 11/3'^-/3* 11 <C 


n 


log d sk log d 


+ 


n 


n 


(4.1) 


Remark 4.2. The term y-00 in (4.1) is the estimation error of ||/3'^ — Lemma 4.1 does 

not rely on any specific choice of k, however, in order for the aggregated estimator (0^ to attain 
the same || • ||oo norm estimation error as the full sample LASSO estimator, /3 lassO) the required 
scaling is A: = O{0n/{s‘^ log d)). This is a weaker scaling requirement than that of Theorem 3.3 
because the latter entails a guarantee of asymptotic normality, which is a stronger result. It is for 
the same reason that our estimation results only require O(-) scaling whilst those for testing require 
o(-) scaling. 

Although 0^ achieves the same rate as the LASSO estimator under the infinity norm, it cannot 
achieve the minimax rate in £2 norm since it is not a sparse estimator. To obtain an estimator 
with the ^'2 minimax rate, we sparsify (3 by hard thresholding. For any (3 G define the hard 
thresholding operator Tu such that the j-th entry of Tu{0) is 


['T0P)]j = (3j L{l/3jl > 0, for l<j<d. (4.2) 

According to (4.1), if 0 = 0, we have |/3j| < C{0\og d/n + sk\ogd/n) with high probability. The 
following theorem characterizes the estimation rate of the thresholded estimator Tu{0^)- 

Theorem 4.3. Suppose Conditions 3.1 and 3.2 are fulfilled and choose A x 0k log d/n, r?! x 
0k log d/n and = o(l). Take the parameter of the hard threshold operator in (4.2) as 

y = CQ0\og d/n for some sufficiently large constant Cq. If the number of subsamples satisfies 
k = O{0n/{s‘^log d)), for large enough d and n, we have with probability I — c/d, 

slogd 
n 

(4.3) 

Remark 4.4. In fact, in the proof of Theorem 4.3, we show that if the thresholding parameter y 
satisfies y > ||/3 —/3*||oo, we have ||7I/(/3 ) —/3*||2 < 20^ ■ y, it is for this reason that we choose 
y X Y^log d/n. Unfortunately, the constant is difficult to choose in practice. In the following 
paragraphs we propose a practical method to select the tuning parameter y. 

Let (M(3 )a( 3)'^)£ denote the transposed row of Inspection of the proof of The¬ 

orem 3.3 reveals that the leading term of term of 0n\\0^ — /3*||oo satisfies 


T0(3‘')-%(p%^<ct 


s^^'^k log d 


n 


\T0(3‘')-0 


< C 


logd 


n 


and 


|r.(/3^)-/3* 


l2<C^ 


rr 1 

io = max 


i<t<d 0k ^ 0nk 




14 






















Chernozhukov et al. (2013) propose the Gaussian multiplier bootstrap to estimate the quantile of 
Tq. Let {CiliLi standard normal random variable independent of {(Li, Consider 

the statistic 

VLo = max \ ^ o ^0)), 

\<l<d y/k ^ -y/fifc 

where G M”'' is an estimator of such that for any i G Xj, (3{Vj)^ and 

is a subvector of {Ci}r=i with indices in Xj. Recall that “o” denotes the Hadamard product. The 
a-quantile of VLo conditioning on (L), is defined as cwo(a) = infjt | IP(lTo ^ t \Y, X) > a}. 

We can estimate cwg («) by Monte-Carlo and thus choose t'o = cwo (®)/This choice ensures 

llr.oCa'') -r II2 = Op(^s log d/n), 
which coincides with the £2 convergence rate of the LASSO. 

Remark 4.5. Lemma 4.1 and Theorem 4.3 show that if the number of subsamples satisfies k = 
o{\/n/{s^ logd)), 11/3'^-,§'^11 00 = op{y/\ogd/n) and \\%{0^) -Tv(P ‘^)\\2 = op(Y^slog d/n), and thus 
the error incurred by the divide and conquer procedure is negligible compared to the statistical 
minimax rate. The reason for this contraction phenomenon is that (3 and /3'^ share the same leading 
term in their Taylor expansions around (3*. The difference between them is only the difference of two 
remainder terms which is smaller order than the leading term. We uncover a similar phenomenon 
in the low dimensional case covered in Section 4.3. However, in the low dimensional case £2 norm 
consistency is automatic while the high dimensional case requires an additional thresholding step 
to guarantee sparsity and, consequently, I 2 norm consistency. 


4.2 The High-Dimensional Generalized Linear Model 


We can generalize the DC estimation of the linear model to GLM. Recall that f3'^{Vj) is the de- 
biased estimator defined in (3.9) and the aggregated estimator is [3‘^ = k~^ LVe still 

denote /3'^ = /3'^(U^^^2?j). The next lemma bounds the error incurred by splitting the sample and 
the statistical rate of convergence of /3*^ in terms of the infinity norm. 

Lemma 4.6. Under Conditions 2.1, 3.6 and 3.7, for (3^ with A x log d/n, we have with 

probability 1 — c/d. 




< and 11/3“ - /3- 


< 


n 


(.l^^l^^ {sXsi)klogd y 


Applying a similar thresholding step as in the linear model, we obtain the following estimation 
rate in £2 norm. 


Theorem 4.7. Under Conditions 2.1 - 3.7, choose A x y/k log d/n and A„ x y/k log d/n. Take the 
parameter of the hard threshold operator in (4.2) as u = Co y^log d/n for some sufficiently large 
constant Cq. If the number of subsamples satisfies k = 0{y/n/{{s V si)^ logd)), for large enough d 
and n, we have with probability 1 — c/d. 




{s V si)s^/^/!:logd 
n 


r.(/3")-r 


00 


<c,/^ 

V n 


(4.5) 
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and ||7I,(/3'^) — — C'V'Slog d/n. 

Remark 4.8. As in the case of the linear model, Theorem 4.7 reveals that the loss incurred by the 
divide and conquer procedure is negligible compared to the statistical minimax estimation error 
provided k = o(y^n/(si V s)2slog(i). 

A similar proof strategy to that of Theorem 4.7 allows us to construct an estimator of Q*^ 
that achieves the required consistency with the scaling of Corollary 3.9. Our estimator is := 
where 0 = 0(^) and is the thresholding operator defined in equation 

(4.2) with () = Cl Y^log d/n for some sufficiently large constant Ci. 

Corollary 4.9. Under the conditions and scaling of Theorem 3.8, |0^u — 0*^| = op(l). 

Substituting this estimator in Corollary 3.9 delivers a practically implementable test statistic 
based on A: = o(((s V si) logd)~^^/n) subsamples. 


4.3 The Low-Dimensional Linear Model 


As mentioned earlier, the infinity norm bound derived in Lemma 4.1 can be used to do model 
selection, after which the selected support can be shared across all the local agents, significantly 
reducing the dimension of the problem as we only need to refit the data on the selected model. The 
remaining challenge is to implement the divide and conquer strategy in the low dimensional setting, 
which is also of independent interest. Here we focus on the linear model, while the generalized linear 
model is covered in Section 4.4. 

In this section d still stands for dimension, but in contrast with the rest of this paper in which 
d^ n, here we consider d < n. More specifically, we consider the linear model (3.2) with d < n and 
i.i.d sub-gaussian noise It is well known that the ordinary least square (0LS) estimator 

of /3* is defined as j3 = X)~^X'^Y. In the massive data setting, the communication cost of 

estimating and inverting covariance matrices is very high (order 0{kd^)). However, as pointed out 
by Chen and Xie (2012), this estimator exactly coincides with the DC estimator. 


3 




vi=i 


i=i 


In this section, we study the DC strategy to approximate /3 with the communication cost only 
0 {kd), which implies that we can only communicate d dimensional vectors. 

The 0LS estimator based on the subsample Vj is defined as l3{'Dj) = 

In order to estimate /3*, a simple and natural idea is to take the average of {/9(Dj)}^^]^, which 
we denote by /3. The question is whether this estimator preserves the statistical error as (3. The 
following theorem gives an upper bound of the gap between (3 and /3, and shows that this gap is 
negligible compared with the statistical error of (3 as long as k is not large. 

Theorem 4.10. Consider the linear model (3.2). Suppose Conditions 3.1 and 3.2 hold and 

are i.i.d sub-Gaussian random variables with ||ei||.i /,2 < ui. If the number of subsamples satisfies 


16 





k = 0{nd/{d V logn)^), then for sufficiently large n and d it follows that 


11/3 - ^||2 = , 11/3 - /3*||2 = Op(v^). (4.6) 

Remark 4.11. By taking k = o{nd/{d\/ logn)^), the loss incurred by the divide and conquer 
procedure, i.e., ||/3 — /3||2, converges at a faster rate than the statistical error of the full sample 
estimator /3. 

We now take a different viewpoint by returning to the high dimensional setting of Section 4.1 
{d 3> n) and applying Theorem 4.10 in the context of a refitting estimator. In this refitting setting, 
the sparsity s of Lemma 4.1 becomes the dimension of a low dimensional parameter estimation 
problem on the selected support. Our refitting estimator is defined as 



s s ^ s ’ 


(4.7) 


where S' := {j : |/3^| > 2(7.^log d/n} and C is the same constant as in (4.1). 

Corollary 4.12. Suppose > 2Cy^log d/n, where := mini<j<rf |/3t| and C is the same 
constant as in (4.1). Define the full sample oracle estimator as (3° = {X'gX s)~^ X'gY , where S is 
the true support of /3*. If = 0{\/n/{s^ log d)), then for sufficiently large n and d we have 

ir-r||, = 0,(Z(iZ2iL), 0-l3-h = Or(V^). (4.8) 

—T 

We see from Corollary 4.12 that /3 achieves the oracle rate when the minimum signal strength 
is not too weak and the number of subsamples k is not too large. 


4.4 The Low-Dimensional Generalized Linear Model 

The next theorem quantifies the gap between /3 and (3, where (3 is the average of subsambled GLM 
estimators and 13 is the full sample GLM estimator. 

Theorem 4.13. Under Condition 3.6, if A: = 0{y/n/{dy\ogn)), then we have for sufficiently large 
d and n, 

= |i;d-/3*||2 = Op(vW^). (4.9) 

Remark 4.14. In analogy to Theorem 4.10, by constraining the growth rate of the number of 
subsamples according to A: = o[y/n/{d V logn)), the error incurred by the divide and conquer 
procedure, i.e., ||/3 — /3||2 decays at a faster rate than that of the statistical error of the full sample 
estimator /3. 

As in the linear model. Lemma 4.6 together with Theorem 4.13 allow us to study the theoretical 
properties of a refitting estimator for the high dimensional GLM. Estimation on the estimated 
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support set is again a low dimensional problem, thus the d of Theorem 4.13 corresponds to the s 
of Lemma 4.6 in this refitting setting. The refitted GLM estimator is defined as 

_ 1 ^ ^ 

4 = tE3’'(»)). (‘‘■10) 

i=i 

where l3''{Vj) = argmin^gRd^^_^=o ^ {j ■ l/^jl > 2C-\/log d/n}. The following 

corollary quantifies the statistical rate of 0 ^. 

Corollary 4.15. Suppose > 2C-yiog d/n, where := mini<j<d |/3*| and C is the same 
constant as in (4.4). Define the full sample oracle estimator as [3° = argmin^gjjd^^^^^Q .^„(/3), where 
S is the true support of /3*. If A: = O (y^n/((s V si)2log d)), then for sufficiently large n and d we 
have 

p''-rii, = o,(hi!hyi2£!i). iir-/3-ii2 = o,{y^). (4.11) 

- T 

We thus see that (3 achieves the oracle rate when the minimum signal strength is not too weak 
and the number of subsamples k is not too large. 

5 Numerical Analysis 

In this section, we illustrate and validate our theoretical findings through simulations. For inference, 
we use QQ plots to compare the distribution of p-values for divide and conquer test statistics to 
their theoretical uniform distribution. We also investigate the estimated type I error and power of 
the divide and conquer tests. For estimation, we validate our claim of Sections 4.3 and 4.4 that 
the loss incurred by the divide and conquer strategy is negligible compared to the statistical error 
of the corresponding full sample estimator in the low dimensional case. An analogous empirical 
verihcation of the theory is performed for the high dimensional case, where we also compare the 
performance of the divide and conquer thresholding estimator of Section 4.1 to the full sample 
LASSO and the average LASSO over subsamples. 

5.1 Results on Inference 

We explore the probability of rejection of a null hypothesis of the form Hq : = 0 when data 

are generated according to the linear model, 

Yi = Xff3*+ei, ei^N{0 ,al), 

for Ug = I and (3* an s sparse d dimensional vector with d = 850 and s = 3. In each Monte Carlo 
replication, we split the initial sample of size n into k subsamples of size n/k. In particular we 
choose n = 840 because it has a large number of factors k G {1,2,5,10,15,20,24,28,30,35,40}. 
The number of simulations is 250. Using /3lasso as a preliminary estimator of /3*, we construct 
Wald and Rao score test statistics as described in Sections 3.1.2 and 3.2 respectively. 

Panels (A) and (B) of Figure I are QQ plots of the p-values of the divide and conquer Wald 
and score test statistics under the null hypothesis against the theoretical quantiles of of the uniform 
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Figure 1: QQ plots of the p-values of the Wald (A) and score (B) divide and conquer test statistics 
against the theoretical quantiles of the uniform [0,1] distribution under the null hypothesis. 

[0,1] distribution for four different values of k. For both test constructions, the distributions of the 
p-values are close to uniform and remain so as we split the data set. When k = 40, the distribution 
of the corresponding p-values is visibly non-uniform, as expected from the theory developed in 
Sections 3.1.2 and 3.2. Panel (A) of Figure 2 also supports our theoretical findings, showing that, 
for both test constructions, the empirical level of the test is close to both the nominal a = 0.05 
level and the level of the full sample oracle OLS estimator which knows the true support of /3*. 
Moreover, it remains at this level as long as we do not split the data set too many times. Panel 
(B) of Figure 2 displays the power of the test for two different signal strengths, j3\ = 0.125 and 
I3\ = 0.15. We see that the power is also comparable with that of the full sample oracle OLS 
estimator which knows the true support of j3*. 


5.2 Results on Estimation 

In this section, we turn our attention to experimental validation of our divide and conquer estima¬ 
tion theory, focusing first on the low dimensional case and then on the high dimensional case. 

5.2.1 The Low-Dimensional Linear Model 

All n X d entries of the design matrix X are generated as i.i.d. standard normal random variables 
and the errors are i.i.d. standard normal as well. The true regression vector /3* satisfies 

/?* = 10/Vd for j = 1,... ,d/2 and /3* = —lO/^/d for j > (i/2, which guarantees that ||/3*||2 = 10. 
Then we generate the response variable according to the model (3.2). Denote the full sample 

ordinary least-squares estimator and the divide and conquer estimator by (3 and (3 respectively. 
Figure 3(A) illustrates the change in the ratio \\(3 — /3||2/||/3 — I3*\\2 as the sample size increases, 
where k assumes three different growth rates and d = T/re/2. Figure 3(B) focuses on the relationship 
between the statistical error of /3 and logk under three different scalings of n and d. All the data 
points are obtained based on average over 100 Monte Carlo replications. 
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(A) Type I error 


(B) Power 




Figure 2: (A) Estimated probabilities of type I error for the Wald and score tests as a function 
of k. (B) Estimated power with signal strength 0.125 and 0.15 for the Wald, and score tests as a 
function of k. 


(A) ||/3-^||2/||3-/3*||2 


(B) ||/3-/3*||2/||/3*||2 




Eigure 3: (A) The ratio between the loss of the divide and conquer procedure and the statistical 
error of the estimator based on the whole sample with d = \/nl2 and different growth rates of k. 
(B) Statistical error of the DC estimator against log k. 

As Figure 3(A) demonstrates, when k = 0(n^/^), or 0(1), the ratio decreases with ever 

faster rates, which is consistent with the argument of Remark 4.11 that the ratio goes to zero when 
k = o{n/d) = When k = 0{^/n), however, we observe that the ratio is essentially constant, 

which suggests the rate we derived in Theorem 4.10 is sharp. 

From Figure 3(B), we see that when k is not large, the statistical error of (3 is very small because 
the loss incurred by the divide and conquer procedure is negligible compared to the statistical error 
of (3. However, when k is larger than a threshold, there is a surge in the statistical error, since the 
loss of the divide and conquer begins to dominate the statistical error of (3. We also notice that the 
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(A) \\(3-p\y0-(3* 


(B) 11/3-/3*112/11/3* 




Figure 4: (A) The ratio between the loss of the divide and conquer procedure and the statistical 
error of the estimator based on the whole sample when d = 20. (B) Statistical error of the DC 
estimator. 

larger the ratio n/d, the larger the threshold of log/:, which is again consistent with Remark 4.11. 
5.2.2 The Low-Dimensional Logistic Regression 

In logistic regression, given covariates X, the response Y\X ~ Ber(77(X)), where Ber(r/) denotes 
the Bernoulli distribution with expectation rj and 

^ l + exp(-X^/3*)’ 

We see that Ber(T/(X)) is in exponential dispersion family canonical form (2.7) with b{9) = log(l + 
e®), (/> = 1 and c{y) = 1. The use of the canonical link, 

" 1 + ’ 

leads to the simplification 0{X) = W^/3*. 

In our Monte Carlo experiments, all n x d entries of the design matrix X are generated as 
i.i.d. standard normal random variables. The true regression vector /3* satisfies /3t = Ijy/d for 
j < d/2 and /?* = —l/\/~d for j > d/2, which guarantees that ||/3*||2 = 1- Finally, we generate the 
response variables according to Ber(77(W)). Figure 4(A) illustrates the change of the ratio 

11/3 —/ 3 II 2 /II /3 —/3*||2 as the sample size increases, where k assumes three different growths rates and 
d = 20. Figure 4(B) focuses on the relationship between the statistical error of /3 and log k under 
three different scalings of n and d. All the data points are obtained based on an average over 100 
Monte Carlo replications. 

Figure 4 reveals similar phenomena to those revealed in Figure 3 of the previous subsection. 
More specifically. Figure 4(A) shows that when k = 0(n^/^), 0(n^/^) or 0(1), the ratio decreases 
with even faster rates, which is consistent with the argument of Remark 4.14 that the ratio converges 
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to zero when k = o{^/n/d) = o(y/n). When k = 0{^/n), however, we observe that the ratio remains 
essentially constant when logn is large, which suggests the rate we derived in Theorem 4.10 is sharp. 

As for Figure 4(B), we again observe that the statistical error of /3 is very small when k is 
sufficiently small, but grows fast when k becomes large. The reasoning is the same as in the linear 
model, i.e. when k is large, the loss incurred by the divide and conquer procedure is non-negligible 
as compared with the statistical error of ||/3||2. In addition, as Figure 4(B) reveals, the larger is 
\/n/d, the larger the threshold of k, which is again consistent with the threshold rate pointed out 
in Remark 4.14. 

5.2.3 The High Dimensional Linear Model 

We now consider the same setting of Section 5.1 with n = 1400, d = 1500 and /?! = 10 for all 
j in the support of (3*. In this context, we analyze the performance of the thresholded averaged 
debiased estimator of Section 4.1. Figure 5(A) depicts the average over 100 Monte Carlo replications 
of II6 — /3*II2 for three different estimators: debiased divide-and-conquer b = Tu{P ), the LASSO 
estimator based on the whole sample b = /Blasso and the estimator obtained by naively averaging 
the LASSO estimators from the k subsamples b = /3lasso- The parameter is taken as u = 
\/log d/n in the specification of % {(3 )■ As expected, the performance of /3lasso deteriorates 
sharply as k increases. Tu{l3'^) outperforms Plasso as long as k is not too large. This is expected 
because, for sufficiently large signal strength, both /3lasso and Tu{P ) recover the correct support, 
however Tu{P ) is unbiased for those /3t in the support of /3*, whilst /3lasso is biased. Figure 
5(B) shows the error incurred by the divide and conquer procedure ||7I/(/3^) — 7I/(/3'^)||2 relative to 
the statistical error of the full sample estimator, \\Ty{P ) — P*\\ 2 -, for four different scalings of k. 
We observe that, with k = O^sjnj log d) and n not too small, the relative error incurred by the 
divide and conquer procedure is essentially constant across n, demonstrating the theory developed 
in Theorem 4.3. 


6 Discussion 

With the advent of the data revolution comes the need to modernize the classical statistical toolkit. 
For very large scale datasets, distribution of data across multiple machines is the only practical 
way to overcome storage and computational limitations. It is thus essential to build aggregation 
procedures for conducting inference based on the combined output of multiple machines. We 
successfully achieve this objective, deriving divide and conquer analogues of the Wald and score 
statistics and providing statistical guarantees on their performance as the number of sample splits 
grows to infinity with the full sample size. Tractable limit distributions of each DC test statistic 
are derived. These distributions are valid as long as the number of subsamples, k, does not grow 
too quickly. In particular, k = o(((s V si) log (i)“^y^) is required in a general likelihood based 
framework. If k grows faster than ((s V si) logd)~^^/n, remainder terms become non-negligible and 
contaminate the tractable limit distribution of the leading term. When attention is restricted to 
the linear model, a faster growth rate of A: = o((s log (i)“^-v/n) is allowed. 
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(A) Estimation error (B) DC error 




Figure 5: (A): Statistical error of the DC estimator, split LASSO and the full sample LASSO for 
k G {1, 2, 5,10, 20,25, 35,40, 50} when n = 1400, d = 1500. (B): Euclidean norm difference between 
the DC thresholded debiased estimator and its full sample analogue. 

The divide and conquer strategy is also successfully applied to estimation of regression param¬ 
eters. We obtain the rate of the loss incurred by the divide and conquer strategy. Based on this 
result, we derive an upper bound on the number of subsamples for preserving the statistical error. 
For low-dimensional models, simple averaging is shown to be effective in preserving the statistical 
error, so long as k = 0{n/d) for the linear model and k = for the generalized linear 

model. For high-dimensional models, the debiased estimator used in the Wald construction is also 
successfully employed, achieving the same statistical error as the LASSO based on the full sample, 
so long as k = O^^nj log d). 

Our contribution advances the understanding of distributed inference and estimation in the 
presence of large scale and distributed data, but there is still a great deal of work to be done in the 
area. We focus here on the fundamentals of statistical inference and estimation in the divide and 
conquer setting. Beyond this, there is a whole toolkit of statistical methodology designed for the 
single sample setting, whose split sample asymptotic properties are yet to be understood. 

7 Proofs 

In this section, we present the proofs of the main theorems appearing in Sections 3.1-4. The 
statements and proofs of several auxiliary lemmas appear in the Supplementary Material. To 
simplify notation, we take = 0 without loss of generality. 

7.1 Proofs for Section 3.1 

The proof of Theorem 3.3, relies on the following lemma, which bounds the probability that opti¬ 
mization problems in (3.4) are feasible. 
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Lemma 7.1. Assume S = E(XjX?’) satisfies Cmin < Amin(5^) < A ma T^(S) < Cmax as well as 
||S-V2 Xi||^2 = K, then we have 

> 1 - 2 kd~^'^, where ci = ° -2. 

24e K Omax 

Proof. The proof is an application of the union bound in Lemma 6.2 of Javanmard and Montanari 
(2014). □ 


[ max - /||max < 


Using Lemma 7.1 we now prove Theorem 7.2, from which Theorem 3.3 easily follows. The 
term Z in the decomposition of \/n(/^ ~ P*) ia Theorem 7.2 is responsible for the asymptotic 
normality of the proposed DC Wald statistic in Theorem 3.3, while the upper bound on k ensures 
A is asymptotically negligible. 

Theorem 7.2. Suppose Conditions 3.1 and 3.2 are fulfilled. Let A x ^yk^og~d/n and -di x 
^/k^ogdJn. With k = o{{s log d)~^y/n), y/n{l3'^—f3*) = Z+A, where Z = ^ 
and ||A||oo = op(l). 

Proof. For notational convenience, we write /3LASSo(^i) simply as (3^{Vj). Decompose — /3* as 
k k 

-(3* = ^ -p* + [P* - P^{Vj))^ + ^ ^ 

i=i i=i 

k k 

= ^ V(I - {P^{Vj) -P*) + \^ — 

k k Uk 

j=i j=i 

hence y/n{p'^ — P*) = Z + A, where 


k k 

Z = ^ ^ and A = {p^{Vj) - P*). 


y/k^^Pn 


1=1 


Defining A(^) = (/ - {p^{Vj) - p*), we have 

||A(^')||oo < ||A(^')||i < -/|UaxP^(D,) -/31|i 

by Holder’s inequality, where ||/ — ||max < by the definition of and, for A = 

Cu^V^log d/uk, 


\p^{vp-p*\\;>c 


log(2(i) 


nk 


+ t]< exp - 


cnkt\ 


s^a‘^ 


) 


(7.1) 


by Biihlmann and van de Geer (2011). We thus bound the expectation of the loss by 


E 


\p\vp-p*\\{ 


^ 2CsHog{2d) , /■“ 

<-h / exp 

life JO 


cnkt\ 2Cs^\og{2d) 

~ dt < -+-. 7.2 

J Uk cuk 
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Define the event := {||/3^(Pj) —< s^JC\og{2d)/uk] for j = ||A(-^)||oo < 

+ Ag'^^ where 

- - i\ujp\vj) - p*\\a{£^^^}] 

- - I\U^0\V,) - nil and 

A^^’) = E[||M(^')e(^') - I\U^\0\Vj) - /3*||i]. 

Consider A^ , A^' and Ag ^ in turn. By Hoeffding’s inequality, we have for any t > 0, 


aP > t I < exp 


i=i 

By Markov’s inequality, 


nfcfct^ \ 
Cs‘^'d‘1 log(2(i) / 


< exp — 


riknV 


Cs^ log^(2d) 


(7.3) 


1 




EhiE[A“l 


i=i 


< 2t-iE[||M(^')s(^') - /|UaxP^(7^i) - P*\\i 
< 2t-^i?iY^E[||n(^^i) - /31|?]E(f(n 


< CJ-I < Ct-hn7^/Hogd, 

\ nk Uk 


(7.4) 


where the penultimate inequality follows from Jensen’s inequality. Finally, by Jensen’s inequality 
again, 

1 ^ 

-J^A^ = E[||MWE(^')-/|Uax||3^(Pi)-nii] 


k 


i=i 


< ^i\E 




< C 


s log d 
nk 


(7.5) 


Combining (7.3), (7.4) and (7.5), 


I A||oo > 2 >C^/n ■ 


u=i y j=i 


slogd 


nk 


< exp{—ckn) + d 0, 

and taking k = o[{s log d)~^^/n) delivers ||A||oo = op(l). 


(7.6) 

□ 


Proof of Theorem 3.3. We verify the requirements of the Lindeberg-Feller central limit theorem 
(e.g. Kallenberg, 1997, Theorem 4.12). Write 


1 ^ y{j) ^ 

where 

^ i=i ^ i=i 


ij)T'vU)Aj) 

0 ) ._ Sj 




nm 
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By the fact that e* is independent of X for all i and E[ej] = 0, 


E(c£V) = E 


iO-)rg(i)mW)-^/2mO-)rj5^p')E(g(i)) ^ g. 

By independence of and the definition of we also have 


= nm. 


Var V 


j=l i&Ij 

= k~^ ^ ^ ^ Var |X) 

j=l i&ij 

It only remains to verify the Lindeberg condition, i.e., 


= a 


iiSo„;T=o^EE®[«“)'l{l«“l>e<’}|vJ =0, Ve>0, 

i=i ieXj 


(7.7) 


By Lemma A.l, < n x\^\\e^p\ < n where liminf^^^ = Coo > 

0, hence the event > etr} is contained in the event > e<Tc„j,'i ?2 ^\/n} have 




(j)\2i,/i (i)i 


cr^ 


j=i iaXj j=i iaXj 


k 


i=i 


nk 


ieXi 


1 


= 


(e?Vl{kP^I > eacnJ2^V^^} 


Let 5 = eacnf,'d 2 Then, for any r] > 0, 


>^} <s-nH^>r]. (7.8) 


-(i)\2Li 


U)\v 


(i)| 


-(i)|2+??i 


6^ 


Since 'i? 2 n = o(l) by the statement of the theorem, the choice rj = 2 delivers 


J'-SLEE® I > 


a k^oo rife—>-oo ^ 

j=l i€X^ 


< lim lim k ^i? 2 C„fe ^<7 ((e- ) ) =0 

“ fc^oonfe^oo ’"'= V ^ ' J 


(7.9) 


by the bounded forth moment assumption. By the law of iterated expectations, all conditional 
results hold in unconditional form as well. Hence, Vn ^ N{0,a'^) by the Lindeberg-Feller central 
limit theorem. □ 
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Proof of Corollary 3.5. Similar to (7.9), we also have 


^ lim 


lim 




CJ"' 


k —^OO 71^ ^CXD 

j=i ieXj 


E 




O') I 


= 0 . 


The proof is complete through an application of the self-normalized Berry-Essen inequality (de la 
Pena et ah, 2009), noting that Sn = Vn + op(l), as demonstrated in the previous proof. □ 


Proof of Lemma 3.4. We first show that, for any j G {1,..., k}, \a‘^{'Dj) — cr^| = op(A: ^). To this 
end, letting 


Ei = - p*), 


we write 





1 

nk 




< A^^'> + + A^^\ 


,(i) ._ 


,(i) ._ 


1 


(i) ._ USA 


1 


(§\V,)-I 3 ‘){-Y.Y^Y)\, Old 


nk ^ 


iex, 


\ fik 


\\x(^^P\Vj)-P*)\\l/nk = OAXh) 


i^Tj 
|2 


by Theorem 6.1 of Biihlmann and van de Geer (2011). Hence, with A = Ca'^yjk log d/n, A^^'^ 
op(l) for k = o((s logd)“^re), a fortiori for k = o((slog(i)“^-v/n). Letting 


A® = \\p\Vj)-px PYYY'-nY'e 


(i)di) 




nk 


ieii 


Ag = \\p\vp-px\Hx^'^e?]\L- 


We obtain the bound 


^(i) _ 

^2 — 


{p\Vj) -P*){ - nx^eX) + - r )E[xWeW] 

•^k 


ieXj 


< +Ag. 


By the statement of the Lemma, = E[Xp^E[ep^|Xp^]] = 0, hence A 22 = 0, while by 

the central limit theorem and Theorem 6.1 of Biihlmann and van de Geer (2011), 

A^i < 0^{Xs)Ow{nlX- 

We conclude A^'^^ = Op(Asn^^^^), and with A x cr'^y/k log d/n, A^'^^ = o(l) with k = o(n(s log d)”^/^), 
a fortiori for k = o[^/n{slogd)~^'j. Finally, noting that cr^ = E[ep^], A^-^^ = Op(n^^^^) = op(l) 
by the central limit theorem. Combining the bounds, we obtain |u^(Pj) — (T^| = op(l) for any 
j G {1,..., A:} and therefore |cj^ — ct^| < k~^ l^^(^i) “ ^^1 = op(l). □ 
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The proofs of Theorem 3.8 and Corollary 3.9 are stated as an application of Lemmas A.7 and 
A. 8 , which apply under a more general set of requirements. 

Proof of Theorem 3.8. We verify (A1)-(A4) of Lemma A.7. For (Al), decompose the object of 
interest as 

— - 0*)||n,ax + — ||X(^')0* |Uax = Ai + As, 

Uk rik ^ ' 

where Ai can be further decomposed and bounded by 


nk 


J_||X(J)(0O) -0*)|| 


1 


Xp)^( 0 (i)_ 0 *)| 


= — max max 
< — max ||XJ|oo max — 0 *||i 

'klk l<v<d 


We have 


P(Ai > ,/2) < P( - ©Jill > ^1) < V- 


and by Condition 3.7, = o{d~^) = o{k~^) for any q > 2CMsi{ky/log d, a fortiori for q a con¬ 
stant. Since Xi is sub-Gaussian, a matching probability bound can easily be obtained for As, thus 
we obtain II < lif ior f) = o{k~^). (A2) and (A3) of Lemma A.7 are applications 

of Lemmas A.3 and A.4 respectively. To establish (A4), observe that in} [P^{Vj))—e^) = 

Ai + As+As, where Al = (©^/^-G;)'^ , As = ©f (V^ 4^ 4^(/3*)) 

and A 3 = ©*^ t'n2(/9*) — e^. We thus consider \ Ai{(3^{Vj) — /3*)| for i = 1,2,3. 

|As(^^(P,) -(3*)\ = - ^ @fX,Xf {P\V,) - 13*) [b"{Xj^\V,)) - b"{Xjl3*\ 


nk 


i&Xi 


< t /3 max|©fxJ —||A(3^(P^)-/3 

l<i<n' Uk ^ 


P^||X(/3'’'(T’j) —/3*) II 2 > n ^sA:log(d/(5)^ < 5 by Lemma A.4, thus P^|A 3 (/ 3 '’^(Pj) —/3*)| > <S 
for t X MU'in~^sk\og{d/5). Invoking Holder’s inequality, Hoeffding’s inequality and Condition 2.1, 
we also obtain, for t x n~^sklog{d/6), 

aS\V,) -(3*)\>t)< P( ©f (- V b"{Xf(3*)X,Xf) - e, \\^\V,) -p*\\ >t). 

/ \ \ Tib / max / 




Therefore 


^ I As (Pj) — (3*)\ > < 25. Finally, with t ^(s V si)A: log(d/5), 

(|Ai(3"(P,)-r)| >i) <ip( ^x(^\^\Vj)-P*) >t), 

\ / \ 'Ik " 'Ik. " ' 

i&Xi 


hence P^| Ai(/3'''(I?j) — (3*)\ > < 25. This follows because P( (^(3^{Vj) — (3*) 

y/sk \og{d/5) < 5 by Lemma A.4 and 

^ xf ( 0 , - ®,)h''{Xfp\Vj)) 


> 

rsj 


ieXi 


< 5 
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by Lemma C.4 of Ning and Liu (2014). 


□ 


Proof of Corollary 3.9. We verify (A5)-(A9) of Lemma A.8. (A5) is satisfied because 0^^ is con¬ 
sistent under the required scaling by the statement of the corollary. (A6) is satisfied by Condition 
3.7. To verify (A7), hrst note that Vt'j(/3*) = j3*) — Yi)Xi. According to Lemma A.2, we 

know that conditional on X, b\Xj(3*) — 1) is a sub-gaussian random variable. Therefore Lemma 
B.5 delivers 

•* 111;) E E vf.(/3-)IU 

V i=i / 


which implies that with probability 1 — c/d, 

k 


EE V^i(/3*)||oo = Cy/nlogd 

j=l i£Xj 


(7.10) 


It only remains to verify (A8). Let V /y^nO*/. By the definition of the log 

likelihood, 

ek“i== 0 


(n 0 *^)V 2 


and by independence of {(L), 

k k 


Z] ^ Z Z Var(eJ^) = Z Z 

j=l i^Xj j=l i&j j=l i^Xj 

^ n 1 ^ 


= 1 . 


re 


2=1 


2 = 1 


By Condition 3.6, 0min > 0, the event > e} coincides with the event {|0*'^ V£j(/3*)| > 

= {|0*'^Xj(li — h'{Xf(3*))\ > £\/0 m\r, n]. Furthermore, since |0*'^Xj| < M by Condi¬ 
tion 3.7, this event is contained in the event {|li — h'{XfI3*)\ > <5}, where 5 = £\/0 m\n n/M. By 
an analogous calculation to that of equation (7.8), we have 


E 


(y, - h'{Xff3*)f\{\Yi - h'{Xfl3*)\ > 6}\X < 5-^E[{Yi - 6'(Xf/3*))"+'"|X]. 


,// vT \ 


Hence, setting r] = 2 and noting that E[(li — b'{X'[P*))‘^^^\X~\ < Cy/2 + r](j)U 2 by Lemma A.2, it 
follows that 


lim lim Z Z ® i'*)^ ^ "L1 1 

k^ootik^oo^ ^ 

> ^}] 


j=i ieXj 



< (^min) 

1 “^ lim lim 

k —^CXD ^oo 

""‘EEe? 

E[A:*Xf]0:(5' 


j=l i£lj 


< (^min) 

1 “^ lim lim 

fc^oo rik^oo 

M^Si/(ree^6»min) 

) = o. 


-2 


(7.11) 
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where the last inequality follows because ||S||max = Umax < by Condition 3.6. Simi¬ 

larly, we have for any e > 0 , 


e ^ lim lim 

fc—>-oo nfe^oo ■ 




(j)i 


j=i ieXj 


Applying the self-normalized Berry-Essen inequality, we complete the proof of this corollary. □ 


7.2 Proofs for Section 3.2 

The proof of Theorem 3.11 relies on several preliminary lemmas, collected in the Supplementary 
Material. Without loss of generality we set Hq : P* = 0 to ease notation. 

Proof of Theorem 3.11. Since S(0) = k~^ (B1)-(B4) of Condition A.9 

are fulfilled under Conditions 3.6 and 2.1 by Lemma A. 10 (see Appendix A). The proof is now 
simply an application of Lemma A. 13 with (3* = 0 under the restriction of the null hypothesis. □ 

Proof of Lemma 3.14. The proof is an application of Lemma A.16, noting that (B1)-(B5) of Con¬ 
dition A.9 are fulfilled under Conditions 3.6 and 2.1 by Lemmas A. 10 and A.ll. □ 


7.3 Proofs for Section 4 


Recall from Section 2 that for an arbitrary matrix M, M£ denotes the transposed row of M 
and [M]e denotes the column of M. 


Proof of Lemma 4-L According to Theorem 7.2, we have — (3*) = Z + A., where Z = 

(’^•6), we prove that ||A||oo/v/u < Csklogd/n with probability 

larger than 1 — exp(—c/cn) — > 1 — ci/d for some constant ci. Since f3^ is a special case of 

(3 when /c = 1, we also have y/n{P^ — (3*) = Z Ai, where (7.6) gives ||A||oo/\/^ < Cs\ogd/n. 
Therefore, we have \\(3 — (3'^\\oo < Csklogd/n with high probability. 

It only remains to bound the rate of ||Z’||oo/\/^- By Condition 3.2, conditioning on 
we have for any ^ = 1 ,..., d. 


Z£\l^ > 1 1 {Xar=i) = P(|^ > 1 1 {XJti) < 2 exp ( - 


1=1 

where k is the variance proxy of £ defined in Condition 3.2 and 

Q,= Wl 

1=1 

Let Qmax = iiiaxi<£<d Qf- Applying the union bound to (7.12), we have 
\Z'^QQiy/ri 1> t 


cnt 

k?Qi 


{X,}U) < P( max^ > 1 1 {Xi}U) 

d 

< J2^(\Z£\/Vn > t < 2 dexp 


cnC 


e=i 


KfQr. 


(7.12) 
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Let t = \/2K‘^Q^ax log d/ (cn), then with conditional probability 1 — 2/d, 

11 -^ 1100 /%^^ < VK^Qmax log d/{cn). 

The last step is to bound Qmax- By the definition of we have 


(7.13) 


-i k 1^1 

j E - E (vr[s!i»)= = - E(^f 


i=i 


i=i 


/C ^ Uh „ 
j=i ieD- 


Z =1 


(7.14) 

where 14 = E“^. The inequality is due to the fact that is the minimizer in (3.4). By condi¬ 
tion (3.2) and the connection between subgaussian and subexponential distributions, the random 
variable satisfies 

<?>! 

Therefore, by Bernstein’s inequality for subexponential random variables, we have 




2=1 


> t] < 2 exp — c 




;)A(; 


nt 




Applying the union bound again, we have 


max 

KKd 


> 8 k 


2=1 


<Y,^{\^Y.^xTmif-E[Xf[ft]ef\>8^nuJ^) <2/d. 


logd 

cn 

logd^ 

cn 


j=i i=i 

Therefore, with probability 1 — 2/d, there exist a constant Ci such that 

1 


Qmax = max Qi < max —y^(X/’fi£)^ — KlxTflA'^ +^[XT -h flii < Ci, 

l<£<d l<£<d\n‘^ L 1 J L 1 J \1 nm 


2 = 1 


logd 


cn 


where the last inequality is due to Condition 3.1. By (7.13), we have with probability 1 — 4/d, 
||Z||oo/\/n < ^/K^^Cl\og(i|J(^. Combining this with the result on ||A||oo delivers the rate in the 
lemma. □ 

Proof of Theorem 4-3. By Lemma 4.1 and k = nj (s^ log d)), there exists a sufficiently large 
Co such that for the event £ := {\\(3‘^ — < Coy^logd/n}, we have P(T) > 1 — c/d. We choose 

V = Co Y^log d/n, which implies that, under £, we have zz > ||/3 — /3*||;^- 

Let S be the support of j3*. The derivations in the remainder of the proof hold on the event £. 
Observe 7/(/35c) = 0 as H/B^cHoo < For j G 5, if \(3*\ > 2iy, we have |/3j| > |/?t| —u>v and thus 

- /3;| = - /?;| < i/. While if 1/3*1 < 2zy, - /3*| < |/3*| V Therefore, 

on the event T, 

||r.(^'') -rII 2 = \\Ws) - Psh ^ ||r.(^") - f3*\\^ = \\%(Ps) - /SJIL ^ 2^/. 
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The statement of the theorem follows because u = Cq y^log d/n and P(T) > 1 — c/d. Following the 
same reasoning, on the event E' := TU{||/3'^ —/3* 11^ < Cov/logWn}U{p''-/3lL < Cosfclogd/n}, 
we have 

^ V~s\\'^i-Moo ^ Cs^/^klogd/n. 

As Lemma 4.1 also gives P(T') > 1 — c/d, the proof is complete. □ 


Proof of Lemma f.6. The strategy of proving this lemma is similar to the proof of Lemma 4.1. In 
the proof of Lemma A.7 and Theorem 3.8, we have shown that 


A-'»•) = - J E e“”'^ +1 ^ 

i=i i=i 


where the remainder term for each j is 


J _ 0 O)T A ^ - p 


nk 


ieXi 


and rji = tXjP* + (1 — f)Xf P^{Vj) for some t G (0,1). We bound Aj by decomposing it into 
three terms: 


IA j 11 00 E 


+ 


(/ - 0 *- ^ h''{xjp*)x,xj){p\vp - p* 


nk " 


iex,- 


h 


+ 


0 * J_ Y,{b"{Xlp\vp) - y'{Xlp*))X,xA {P\vp - p*) 

■v* 

h 

( 0 (i) _ 0 *)T^ ^ b"{Xjp\vp)X,xA{p\vp - / 3 = 


nk ^ 


iex, 


I 3 


By Hoeffding’s inequality and Condition 3.3, the first term is bounded by 

|/il < J _ 0 * J_ V h"{xfp*)x,xl p\vp - p 


nk " 




^ ^ gfclogd 
1 “ n ’ 


(7.15) 


with probability 1 — c/d. By Condition 3.6 (iii). Condition 3.7 (iv) and Lemma A.4, we have with 
probability \ — c/d, 


I/ 2 I < max ||©*X,||^— V \J^\XS^{VP - P*)f < C 


2 ^ ^sklogd 


n 


(7.16) 
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Finally, we bound I3 by with probability 1 — c/d, 


I/ 3 I < [U2-Y.b''{Xj(3\V,))[Xj{Q^^'^-Q*)f] i Y.[X,{l3\Vj) - 

/ \Ufc 


ieXj 


i&Ii 


< c 


(si V s)klogd 


n 


(7.17) 


where the last inequality is due to Lemma A.4 and Lemma C.4 of Ning and Liu (2014). 
Combining (7.15) - (7.17) and applying the union bound, we have 


1 

k 



< max II A,- Iloo = Op 

00 j 


(si V s)A:logd\ 


n 


Therefore, we only need to bound the inhnity norm of the leading term T. By Condition 3.7 and 
equation (7.10), we have with probability I — c/d, 


max max ||0l')-0:ili<Csiv/log^and ||-Vv4^^(r)|| <Cy/\ogd/n. (7.18) 

l<j<kl<v<d k 


i=i 


This, together with Condition 3.6 and Condition 3.7 give the bound, 

1 ^ 

||T||oo< (Mmax||©W-0:||i + max||Xf0*||oo) 




< C 


i=i 


logd Si log d 
n n 


with probability 1 — c/d. Since (3^ is a special case of (3 when k = 1, the proof of the lemma is 
complete. □ 

Proof of Corollary /.9. By an analogous proof strategy to that of Theorem 4.7, | [7((0)],;.i; — = 

n ^ log d) = op(l) under the conditions of the Corollary provided k = o(((s V si) log d) ^^/n) ■ 

□ 


Proof of Theorem f.lO. 
k 

_ 3 = i - {X^xy^X^Y 

= IY( - {X^X/n)-A (7.19) 

^ i=i ^ 7 

k 

= ^ ^ r(A(7)^x(7)/nfc)“^ - T-A A(7)^e0)/nfc + (S-^ - (X^X/n)-^) X^e/n. 

^ j=l ^ 7 

For simplicity, denote X^^'>'^X^^^nk by X'^X/n by Sx, (<S'x^)~^ — (51)“^ by and (S)“^ — 
Sx~^ by D 2 . For any r G M, define an event = {||(*S'^^)“^||2 < 2/Cmin} C — S||2 < 
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((5i V 5^)} for all j = 1,... ,A:, where hi = t! y/^-, and an event £ = {||(5'x)~^||2 < 

2/C'min} n {||Sx — SII2 < (h2 V h|)}, where 82 = Cwjdjn + t jy/n. Note that by Lemma B.l and 
B.4, the probability of both {E^d)Y and £.^ are very small. In particular 

< exp(—cn) + exp(—ciT^) and < e:iq){—cn/k) + exp(—cir^). 

k 

Then, letting Tq := n an application of the union bound and Lemma B.8 delivers 
i=i 


(3 - (3\\2 >t) < 


J 2 {X(-^^DYYs^^^/nk ^ > t/2| n To 
+ P ({II {XD2fe/n\\2 > t/2} n T) + P(To") + P(T") 


< 2 exp 


^(ilog(6) - ^ +feexp(-cn//i;) + (/i; + l)exp( 


-ClT^). 


When d —)■ 00 and logn = o{d), choose r = y/d/c\ and hi = 0{y/kdfn). Then there exists a 
constant C such that 


\\P-f3h>C 


y/kd 


cn. 


n 


< (A; + 3) exp(—d) + k exp(—— 

k 


Otherwise choose r = y^log n/ci and hi = 0{y/k logn/n). Then there exists a constant C such 
that 

, y/k log n 


P ||/3-/3||2>0- 


n 


k + 2) / cn 

<-h fcexpf—— j. 

n k 


Overall, we have 


\\P-f3h>C 


y/k{d V logn) 


n 


< c/cexp(—(d V logn)) + kexp{—cn/k), 


which leads to the final conclusion. 


□ 


Proof of Corollary 4-12. Define an event £ = {\\(3'^ — (3*\\oo < 2Cy/logdJn}, then by the condition 
on the minimal signal strength and Lemma 4.1, for some constant C we have 


||/3''-/3°||2>C 


,;y/k{s V logn) 


< 


n 


||/3^'-/3°||2>C 


,;y/k{s V logn) 


n 


<p N 11/3 -/3°||2 >C 


,, y/k{s V logn) 


n 


n T + P(T=) 


n T + cjd 


< ckex.p{—{s V logn)) + kex.p{—cn/k) + cfd. 


where /3° 


^ {Xg'^'^Xg'^) which is the average of the oracle estimators on the 

i=i 


subsamples. Then the conclusion can be easily validated. 


□ 
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Proof of Theorem 4-13. The following notation is used throughout the proof. 

S{f3) := VHn{f3) = -X^D{Xp)X, := D{X^^'>p)X^^\ 

n ^ Uk 

Sx := -X^X, := —X^^^^X^^) 

n nk 

For any j = 1,..., /c, satisfies 

= 0 . 

nk 

Through a Taylor expansion of the left hand side at the point /3 = /3*, we have 

—- /x(x(^’)/3*)) - S^3)m{i) _ p*\ _ = q, 

nk 

where the remainder term is a d dimensional vector with component 

-r)'^V^[(xW)'^/.t(x(^)/3)](^(^) -/3*) 

= —0^^^ - fB*fX^^^^dmg{X0 o - jS*), 

6nfc " 

where is in a line segment between and P*. It therefore follows that 
pU) =p* + (5(i))-i[x(^')^(rW - n{X^3)fj*)) + rikr^J)^. 

A similar equation holds for the global MLE /3: 

P = P* + - fiiXp*)) + nr], 

where for s' = 1 ,..., d, 

rg = ^iP-P*fX^diag{XgOfj,"{{XP^0)}X{P-P*). 

Therefore we have 

k k 

i ^30) -p=^ - fi{X^^0*)) 

j=i J=i 

- {5-^ - E-^}X^(Y - fi(Xp*)) + R = B + R, 

k 

where R = (l/k) ^ — S~^r. We next derive stochastic bounds for ||S||2 and ||i?||2 

i=i 

respectively, but to study the appropriate threshold, we introduce the following events with prob¬ 
ability that approaches one under appropriate scaling. For j = 1,... ,k and n,T,t > 0, 

SU) := {||(5(^'))-1||2 < 2/Cn,in}n - S ||2 < (di V d?)} n < 20^..}, 

£ := {115-^2 < 2/L^in} n {||5 - SII 2 < (d 2 V di)} n {||5 x||2 < 2 C^,^}, 

- |||3(i) _ , X-.= [\\P- P*\\2 > t} , 
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where hi = Ci ^/d/n^ + T/ y/n^ and 82 = Ci^J~dJ^+Tj^/n. Denote the intersection of all the above 

events by A. Note that Condition 3.6 implies that ^Jb"{XJ'(3)Xi are i.i.d. sub-gaussian vectors, 
so by Lemmas B.l, B.4, B.3 and B.IO, we have 

< (2fc + 1) exp (-^) +{k + l) exp(-cir2) + 2fcexp log 6 - . 

We first consider the bounded design, i.e.. Condition 3.6 (ii). In order to bound ||i?|| 2 , we first 
derive an upper bound for rg^\ Under the event A, by Lemma A.5 we have 

max and max Vg < ^MUsCmax^^- 

i<g<<i,i<j<k ^ 3 i<g<d 3 

It follows that, under A, 

II.RII 2 < ■^MVdUsCraa.xt'^- (7.20) 

Note that B is very similar to the RHS of Equation (7.19). Now we use essentially the same 
proof strategy as in the OLS part to bound ||.B|| 2 . Following similar notations as in OLS, we denote 
( 50 ))-i _ s-i by S-i - S-i by D 2 , - fi{X(-^^f3*) by and Y - fx{Xf3*) by £. For 

concision, we relegate the details of the proof to Lemma B.9, which delivers the following stochastic 
bound on ||S|| 2 . 




/ /^4 t2 j.2 \ 

F({||B||. > M n4) < 2exp (rflog(6) - ^ 

Combining Equation (7.21) with (7.20) leads us to the following inequality. 

P ^11/3 - P \\2 > < (2A; + 1) exp (-y) + {k + 1) exp(-cir^) 

/ ^2 t2 j.2 \ / ^4 t2 j2 \ 

+ (. + 1) exp (.logo - + 2exp (..ogO - Ly ) ■ 

Choose t = ti = sjd/nk and, when d ;:g> log re, choose r = d/ci and hi = O^^kdln). Then there 
exists a constant C > 0 such that 

( — -- kd^A\ rn 

P 11/3 - /3||2 > < (2fc + 1) exp(-^) + 2{k + 2) exp(-d). 

choose r = ^logre/ci and h = 0{^/lAog^^Jn). Then there 


When it is not true that d S> log re, 
exists a constant C > 0 such that 


P-5lle>cLiP!?5!?)<(2t + l)exp(-f) + 

n I k 


/c + 3 
re 


Overall, we have 

P ^11/3 — /3||2 > ^ log re) ^ ^ cA:exp(—cre//c) + c/cexp(—cmax(d, logre)), 

which leads to the final conclusion. 


□ 
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Proof of Corollary 4-15. Define an event = {11/3*^ — /3*||oo < 2C^y^og~d/n}, then by the conditions 
of Corollary 4.15 and results of Lemma 4.6 and Theorem 4.13, 


0 -n i < 


n 


\\P^ -P'^h>C 


, fky / s{s V logn) 


n 


<PN \\(3-n>C‘ 


,ky/s{s V logn) 


n 


n£j+ 

n <5 ) + cjd 


< ck exp(—(s V log n)) + k exp(—cn/fe) + c/d. 


where /3° = j E = argmax^gRd^^^^^o and C is a constant. Then it is not 

i=i 

hard to see that the final conclusion is true. □ 


Acknowledgements: The authors thank Weichen Wang, Jason Lee and Yuekai Sun for helpful 
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Supplementary material to 

Distributed Estimation and Inference with Statistical Guarantees 

Heather Battey*^ Jianqing Fan* Han Liu* Junwei Lu* Ziwei Zhu* 

Abstract 

This document contains the supplementary material to the paper “Distributed Estima¬ 
tion and Inference with Statistical Guarantees”. In Appendix A, we provide the proofs 
of technical results required for the analysis of divide and conquer inference. Appendix B 
collects the proofs of lemmas for the estimation part. 


A Auxiliary Lemmas for Inference 


In this section, we provide the proofs of the technical lemmas for the divide and conquer inference. 

Lemma A.l. Under Condition 3.2, (mE^SmE) > c^j, for any j G {1,..., A;} and for any 
V € {1,... ,d}, where satisfies liminf„^_).oo Cn^ = Coo > 0. 

Proof. The proof appears in the proof of Lemma BI of Zhao et al. (2014b). □ 

Lemma A.2. Under the GLM (2.7), we have 

Eexp(f(y — y{0))) = exp((/i“^(6(0 -|- tcp) — b{9) — cfth'{9))), 


and typically when there exists C/ > 0 such that h"[9) < U for all 0 G M, we will have 

Eexp(f(y — p{9))) < exp 

which implies that U is a sub-Gaussian random variable with variance proxy (fU. 



Proof. 


Eexp(t(y - p{9))) = J c{y)exp exp(t(y - y{9)))dy 


f-too 


c{y) exp 


' —OO 

r+oo 


{9 + tct>)y-{b{9) + ct>th'{9)) 


dy 


c{y) exp 


{9 -I- t4>)y - b{9 -I- tcj)) + b{9 + tcf) - {b{9) -|- (l)tb'{9)) 


dy 


= exp ((/) ^{b{9 + tcj)) — b{9) — cj)tb'{9))') . 
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When h"{6) < U. the mean value theorem gives 


2+2 


E exp {t{Y — n{9))) = exp 


2(/> 


< exp 


(pUt^ 


□ 

Lemma A.3. Under Condition 3.6, we have for any /3, (3' G and any i = 1,.. . , n, (3) — 

i'l{Xj(3')\ < Ki\Xj{fi - /3')|, where 0 < < oo. 

Proof. By the canonical form of the generalized linear model (equation (2.8)), 

|£''(Xf/3)-//(Xf/3')| = |6"(Af/3)-6"(Xf/3')| < |6'"(i?)||xf (/3 -/3')| 

by the mean value theorem, where rj lies in a line segment between Xf [3 and Xf I3'. \b"'{r])\ < 
Us < oo hy Condition 3.6 for any r], hence the conclusion follows with Ki = Us for all i. □ 

Lemma A. 4. Under Conditions 2.6 and 2.1 (i), we have for any 5 G (0,1) such that <5“^ <C d, 




n 

Proof. Decompose the object of interest as 

^\\X0^-(3*)f 

n 


\og{d/5)\ 


n / 


< (5 


-\\X(I3^ - p*)\\: = (p^-(3* f(J:-Y)(^^-l3*) + (l3^-p*fY(p^-l3*) 
n" 

< ||S - - (3*\\\ + A„,ax(S)||3^ - (3*\\l. 

This gives rise to the tail probability bound 

E(i||A(3^-r)||^ >t)< p(||S-S|UaxP^-ni? > 1) +p(A„,ax(S)P^-/3*||2 > . (A.l) 

Let M := {||S — E||oo < M}. Since {X is bounded, it is sub-Gaussian as well. Suppose 
||-^i||i /)2 < then by Lemma B.2 we have, 


E(M'=)< j;P(|sW-Sp,|>M) 

p,q=l 

-m2 Mi 


< d? exp ( —Cn ■ min| 


J 


where (7 is a constant. Hence taking M = n ^ log{d/5), 


^(Al'^) < ^2exp Cnminl 


(log(d/(5))2 (log((i/<5))^ 


K'^rP 




}} 


and the right hand side is less than 5 for 5 ^ <C d. Thus by Condition 2.1, the first term on the 
right hand side of equation (A.l) is 


S-S /3 -/9* :> 

11 max 11^ ^ 111 ~ 


2 ^ s\og{d/5)\ 


< 26. 


n 


2 








Furthermore, by Condition 3.6 (i), the second term on the right hand side of equation (A.l) is 

p(An,ax(5])||3^ -rll' > < 6 . 

Taking t as the dominant term, t x Caia.^n~^s\og{d/5), yields the result. □ 

Lemma A. 5. Under Condition 3.6, we have for any i = 1, ... , n, 

|6"(Xf/3i) - b"{Xf(32)\ < MC/3||/3i - /32||i, 
and if we consider the sub-Gaussian design instead, we have 

P {\h"{Xf(3i) - h''{Xffi 2 )\ > hC/3||/3i - (32 \\i) < ndexp (^1 - ^ j . 

Proof. For the bounded design, by Condition 3.6 (iii), we have 

|5"(Xf/3i) -6"(Xf/32)| < U3|Xf(/3i -/32)| < - P2\\i < MU^WfBi - P2\\i. 

For the sub-Gaussian design, denote the event {maxi<j<„^i<j<£^ | < h} by C, where k is a 

positive constant. Then it follows that, 

F {C^) < ndexp ^1-, 

where C is a constant. Since on the event C, \b"{Xff3i) — b"{Xj'f32)\ < hUsWPi — /32||i, we reach 
the conclusion. □ 


Remark A.6. For the sub-Gaussian design, in order to let the tail probability go to zero, h ^ 
log((n V d)). 


Lemma A.7. Suppose, for any k d satisfying k = o(((s V si)logd) the following condi¬ 


tions are satisfied. (Al) P > Hj < where H is a constant and ^ = o{k ^). 

(A2) For any /3, (3' G and for any i G {1,..., n}, \i'f{Xj'(3) — £'l{Xj'(3')\ < Ki\x'[{(3 — (3') \ with 
F{Ki >h)<i)ioYif = o{k-^) and h = 0(1). (A3) -I3*)f^ > n-^sk\og{d/5)^ < 5. 

(A4) P(maxi<,<d| - e,) {p^{Vj) - /3*) | > n-^sk\og{d/5)) < 5. Then 


= + op(n-i/2)_ 

i=i 


for any 1 < u < d. 

Proof of Lemma A. 7. — jd* = k~^ By the definition of f3‘^{T>j), 

Pt{v,) - /?: = - PI - 


3 







Consider a mean value expansion of V 6^1 {(3^ around (3*: 

= Vl^l{(3*) + VH^l{(3o^)0\Vj)-l3*), 
where Pa = <a/3^(^j) + (1 “ Oi)P*, a G [0,1]. So 

-p: = -i ^©o)r v 4^2(/3*) - r - ^v){P\vp - r) 


i=i 


i=i 


i=i 


and |A| < ^ X]j=i(l^i'^^l + 1^2^^I) where 


|aW| = 


V2 4^2 {P\V,)) - eP) {p\vp - p*) 


By (A4) of the lemma, for t >i n ^sklog{d/5), 

k 

.(i)i 


^ A^l > ktj < p(u4i|a4| >tj < ^P(|a4| >t) < k 6 . 

j=i i=i 

Substituting 6 = o{k~^) in the expression for t and noting that A: <C d, we obtain k~^ Sj=i ~ 
op(n“^/^) for k = o((s logd)“^y^). By (A2), 


.0')| _ 




^ ^ (/9"(^.) - P*) {^”{XjPa) - ei{Xjp\vp)) 

i^Xj 


therefore by (Al) and (A3) of the lemma, for t^n ^sk\og{d/5)^ 


y^A^i >kt) <p(^u4iiAyi >4 < j^PdA^-^'^i >t)<k{p+5+e). 

i=i i=i 

Substituting 5 = o{k~^) in the expression for t and noting that A: <C d, we obtain k~^ Xlj=i ^ 2 '^ — 
op(n“^/^) for sAlog(d/d) = o{y/n), i.e. for k = o((slogd)“^\/^)• Combining these two results 
delivers A = op(n" for A = o((slogd) ^y/n). □ 

Lemma A.8. Suppose, in addition to Conditions (Al)-(A5) of Lemma A.7, (A5) — 0*^| = 

op(l)forallu E {!,..., d}; (A6) 1/0;^ = 0(1) for allu E {l,...,d}; (A7) || Ei<j<fc EieXjA7Aj(/3*)||oo 
Of>{y/n logd); (A8) For each v E {l,...,d}, letting V£4(/3*)/\A^, = 0, 

Var(ELi Eiei, ^2^) = 1 and, for all e > 0, 




k^OO TLj^^OO 


(A.2) 


j=i ieVj 


4 















Then under Hq : P* = /3^, taking k = o(((s V si) logd) ^^/n) delivers ^ -^^(0,1), where Sn is 
defined in equation (3.13). 

Proof. Rewrite equation (3.13) as 


f (0*Ji/2 

Sn = yn- 2 ^ + 


k 

j=i ieXj 

(,■) _ ^(,) ^ ( (0;ji/2 


-1 


(A.3) 




(n0*jV2 ’ 2,* (n0*^)V2 

Further decomposing the first term, we have 


01/2 

^vv 


- 1 


EE^S = EE + A, where A=EE{®i"-®a 


,U) 


j=l i&Xj 


(n0*Ji/2 


j=l i&Xj j=l i&Xj 

and X]j=i X^iex ^ -^(0) 1) by the Lindeberg-Feller central limit theorem. Then by Holder’s 
inequality, Condition 3.7 and Assumption (A6) and (A7), 

lAi ^ iiAili ^*11 II Si=i Sielj 

A < max 0)/'' — 0„ b — — - —-— 

i<j<k" "1 


(n0^Ji/2 

= Op(siY'^^^^^^^Op(\/k)g^) = op(l), 


where the last equation holds with the choice of fe = o((si logd) ^\/n). Letting A^'^^ = (0)i^)^/^ — 


0^{^ we have 


EE ^2 

j=i ieXj 


E E + E E{ei” - es’-TAgaso) 


(e;.)‘A 

. (1) I A L) 


j=l i£X^ 

E E +^ 2 *) ’ 

j=l 


j=i ieXj 




|EEAai|<|EEf“|l 0 T-(®:.)‘"’l' 

j=l i&Xj j=l i&Xj 


Since Ql, > 0, = |0,,|i/2 = |0,, - 0;, + 0:ji/2 < |0,, - 0;ji/2 + (0;ji/2. Similarly 

(0;ji/2 = |0;ji/2 = |0;^ - + 0,,|i/2 < |0);^ - 0,,|i/2 + eyf, 
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yielding < |0 '!;d — and consequently, by assumption (A5), 




^1/2 


Invoking (A9) and the Lindeberg-Feller CLT, lEE ^ 21 , 1 1 = Op(l)op(l) = op(l). Similarly 

j=l i&Ij 


lEE^ 

j=i iax. 


U) 

22,j 


< maxJ|0W 


1-1/2 ^(j) 

j=l i&X. 

Combining all terms in the decomposition (A.3) delivers the result. 


= op(l). 


□ 


(B1)-(B5) of Condition A.9 are used in the proofs of subsequent lemmas. 

Condition A.9 . (Bl) ||tu*||i < si, || J*||max < oo and for any 6 G (0,1), 

> n-^/^sy^Iogid/^^ < S and P(^||t(} - w*\\i > n~^/‘^siy/log{d/6)^ < 5. 

(B2) For any <5 G (0,1), 


p(||V_X(/3:,/31J||oo > n-VVlog(d/<I)) <A 
(B3) Suppose satisfies (Bl). Then for /3-v,a = ctfl-v + (1 “ ct)/3-^ and for any <5 G (0,1), 


sup 

.ae[o,i] 


{Vl_Jn{P:,f3-,,a) - W^V\_Jn{/3;,P-v,a))0\ - P*-,) 


> 


SlS- 


lo g(d/<5) \ 

n I 


< 6 . 


(B4) There exists a constant C > 0 such that C < 
that 

^V*^Vinil3;,PU) 

-^ ^ 

V V*'^J*V* 


< oo, and for v* 
A(0,1). 


{1, , it holds 


(B5) For any 6, if there exists an estimator /3 = (/3j, P'L^Y' satisfying ||/3—/3* ||i < Cs^/n ^log((i/(5) 
with probability >1 — 5, then 


p(||v 2 4(/3) - JiL,, > < 5. 


The proof of Theorem 3.11 is an application of Lemma A. 13. To apply this Lemma, we must 
first verify (Bl) to (B4) of Condition A.9. We do this in Lemma A.10. 

Lemma A.10. Under the requirements of Theorem 3.11, (Bl) - (B4) of Condition A.9 are fulfilled. 


Proof. Verification of (Bl). As stated in Theorem 3.11, 11^*111 = 0(si) and ||J*||max < oo by 
part (i) of Condition 3.6. The rest of (Bl) follows from the proof of Lemma C.3 of Ning and Liu 
(2014). 
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Verification of (B2). Let X, = {Qi,Zf)^. Since || 4(/3*) ||oo = ||-^ W/3*))^*|lc 

since the product of a subgaussian random variable and a bounded random variable is subgaussian, 
and since E[V-y4(/9*)] = 0, we have by Condition 3.6, Bernstein’s inequality and the union bound 

Setting 2{d — 1) exp{—nt^/M^cr^} = <5 and solving for t delivers the result. 


Verification of (B3) Let /3* = {9*,ja) and decompose the object of interest as 




t=l 


where the terms Ai - A 5 are given by Ai = _^4(/3a) — _j,4(/3*), 

A2 = A3 = 

A4 = - V\-Jnm), As = {W*^ - W^)V\_JM). 

We have the following bounds 

I 1 ” 


2=1 


< max Ki max ||Xj||oo ||- Ply)\\l, 
l<i<n l<i<n . n 

IA2I < \\Vl_Jnm-Jv,-v\L\\P-v-P-v\\l, IA3I < lkl|l||j-.,-.-Vl,,_A(/ 3 *)|Laxll^--/ 3 ^lll’ 
IA 4 I = 

< max Ki\\w* Will-Z{(3^y - P*_y)\\ 

l<i<n 


|2 

I 2 ’ 


and jAsI < ||tu* — •u}||i||V_^,_t,4(/9*)||j^g^^||/3l^ — /3L^||i. Let e = 5/5. Then by Condition 3.6 and 
Lemma A.4 




< e and P( IA 4 I > ssi 


log 4 /e) \ 
n J 


< £. 


Noting the /3* itself satisfies the requirements on (3 in (B5), Lemma A.11 and Condition 2.1 together 
give 


A 2 I > Si 


log 4 /e)N 


n J 

By (Bl) verified above and noting that 


< e and P( IA 3 I > sis 


lo g(d/e) N 
n / 


< 


|v_,,_,4(/3:)|L,, < \\v-y,-MP:) - v_,,_x(r)lL,, + 

the proof of Lemma A.11 delivers P^jAsI > sislog(d/e)/n^ < e. Combining the bounds, we finally 


have 


P( sup " d-y) 

^ae[ 0 ,l] 

Verification of (B4). See Ning and Liu (2014), proof of Lemma C.2. 


^ sis 


log(d/ 4 \ ^ ^ 


n / 


□ 
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In the following lemma, we verify (B5) under the same conditions. 

Lemma A.11. Under Conditions 3.6 and 2.1, (B5) of Condition A.9 is fulfilled. 

Proof. We obtain a tail probability bound for Ai and A 2 in the decomposition 

II V2 4(^) - r ||„,ax < II V2 - V2 4(/3*)||max + || 4(/3*) - J* |Uax = Ai + A 2 . 


For the control over Ai, note that by Condition 3.6 (ii) and (hi), 

U(3*)]jk\ < \b"iXfp*)\\XijXik\ < U2M\ 

Hence Hoeffding’s inequality and the union bound deliver 


P(A2 >t) =p(^||v2 4(/3*)-J*|| 

For the control over Ai, we have by Lemma A.5, 


> < 2 (fi exp| — 


nt^ -y 


(A.5) 


= \{h"{Xll3)-b"{Xfp*))X,,X,u\ 

< - (3*\\i< M^U 3 S^/n-^log{d/^) 

with probability > 1 — h. Hoeffding’s inequality and the union bound again deliver 


P(Ai > t) = p( II 4(/3 


m 


> < 2 (f exp| — 


nPt'^ 


SUlM^s^ log{d/S) 


Combining the bounds from equations (A.5) and (A.6) we have 


»(||v 2 4/3)-J*|Uax>i) < 2 (f(e^p[- 


nP 




8[/|M4 J 


+ expl - 


8 UlM<^s^ log{d/ 6 ) 


}, {A.6) 


})■ 


Setting each term equal to 6/2, solving for t and ignoring the relative magnitude of constants, we 
have t = U 3 meix{n~^s\og{d/ 6 ),n~^^'^y/log(dj 6 )] = U 3 n~^^‘^log{d/ 6 ), thus verifying (B5). □ 

Lemma A.12. For each j G {1,..., A:}, let f3-v,aj = Oij(3^jj{T>j) + {l — aj)/3fy, for some aj G [0,1], 
where /3^.„(21j) is defined in equation (2.2). Define 


Under (Bl) - (B3) of Condition A.9, k = 0 ]p[n ^/^) and k = op[n ^/^) 

whenever A: <C d is chosen to satisfy k = o((si log d)~^y/n). 

Proof. By Holder’s inequality, 

|aS^')| = \{w* < ||a)(P,)-^*llil|V-.^(f2(/5-/3-Jlloo, 












hence, for any t, 


>t}c {\\w{V,) - to*||i||V_.£W(/3-/3-)lloo > t}. 

Taking t = vq where v = Cn~^^‘^Sl^yklog{d/6) and q = ^yk\og{d/6), we have 

= p(^{ii'u)(pj) -'«^1llllv-^,4iH/3;^,/3-^)lloo > vq} n < i|^ 

+ p({||^(P,) - w*Uv.JJlip;,(3*_,)\\oo > vq} n I > i|) < 25 
by (Bl) and (B2) of Condition A.9. Hence the union bound delivers 


K K 

^ A^l > kvq'j < P^Uj=;^{|Ak| > A = °(1) 


i=i 


i=i 


for 6 = o{k ^). Taking 6 = k ^ for a > 0 arbitrarily small in the definition of v and q, the 
requirement is ksi logd = o[^/n} and ksi logk = o{y/n) for a > 0 arbitrarily small. Since k ^ d, 
k~^ Yl^j=i with k = o((si logd)~^y/n). Next, consider 

|Aki < sup . 

« 6 [ 0 , 1 ] 

By (B3) of Condition A.9, P(|Ak| > t} < 6 for t X sisn ^felog(d/5), hence, proceeding in an 


analogous fashion to in the control over k ^ '^j=i obtain 


EA 

i=i 


> kt] < P 


(Ai 


A^l >tj < (^|Ak| >tj <k6 = o(l) 

i=i 


ior 6 = o{k ^). HenceA: ^ with A = o((sis logd) Since (si logd) ^y/n 

o((sislogd)“^n^/^), k~^ + A^) = op(n“^/^) requires k = o((si log d)“^\/n). □ 

Lemma A.13. Under (Bl) - (B4) of Condition A.9, with k d chosen to satisfy the scaling 
k = o(((s V si)logd)“^^). 


k k 

= i^s(^)(/3:,/31J + op(n-V2) and 
i=i i=i 

k 

hm sup|P((J:|_J-V2^i <t)- <f>(t)| ^ 0. 

n—>-oo ^ 

j=i 

Proof. Recall 
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Through a mean value expansion of S^^\l3*,f3\{'Dj)) around we have for each j G {1,..., fc}, 


^2 > 


for some (3-y^a = a/3-i;(T’j) + (1 — a)l3*_^, where 
a(^') = {w* 


aO') - 
^2 — 


It follows that 

k 


= i^5(^)(/3:,/31J + i^(A?'4A(^)) = i J^5(^)(r,7*) + op(n-V2) 


j=i j=i j=i j=i 


by Lemma A.12 whenever k = o((silogd) Observe 

k 1 ^ 

n(A:-'= V^(l,-m*^)(-and 


(A.7) 


i=i 


i=i 

,.*r\ T* „.,*T\T 


So ^/n]^Yl’j=i P-v) by Condition (B4). Similar to Corollary 3.9, we apply 

the Berry-Essen inequality to show that sup^ \F{^/n^ '^’j=i ^^^\PyP*-v) < P ~ ‘^(^)l —^0. □ 

Lemma A.14. Under Condition (Bl), for any 5 G (0,1), 


w — w 


|i > Cn ^^‘^Sl^yk \og{d/6)j < k5 and P^||/3_.„ —/31^||i > Cn ^^‘^s^/klog 
Proof. Set t = Csl^/n~yJf\og(d/6)) and note 

k k 

^^{w{'Dj) — m*)||i > kt) < F[Uj^i\\w{'Dj) — m*||i > t) < ^P(||lu — m*||i > t) 


< kS. 


j=i 


i=i 


< k6. The 
□ 


by the union bound. Then by Condition (Bl), P^||m — w* 111 > Cn ^k^si^klo, 
proof of the second bound is analogous, setting t = Cs^Jf^~PJy[oyfdJFy). 

Lemma A.15. Suppose (B5) of Condition A.9 is satisfied. For any <5, if there exists an estimator 
P — (PIjP-v)'^ satisfying \\(3 — (3*\\i < C,Sy/ri~^Pog(d/F) with probability 1 — (5, then 

1 ^ 

>C'n-VVA:log(d/<5)) < M. 
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Proof. The proof follows from (B5) in Condition A.9 via an analogous argument to that of Lemma 
A. 14, taking t = C^/n~^{k \og{d/5)). □ 


Lemma A.16. Suppose (B1)-(B5) of Condition A.9 are fulfilled. Then for any k d satisfying 
k = o(((s V Sl)logd)-^^/n), \Joi^ - = op(l). 

Proof. Recall that J*|_^ = - J* and 

k 

= I so 


T _ T* 

R ^i;|—1;| 


k k 

- j:,.| + 




i=i 


i=i 


Ai 


Let (3 = (/3^,/3_.y) and note that ||/3 —/3*||i satisfies the clause in (B5) of Condition A.9 by Lemma 
A.14 when k = o(((s V si) log . Hence Ai = op(l) by Lemma A.15. 


As < 


k 

(pi- r,^i) 

i=i 


A 21 


A 22 


w 




j=i 


A 23 


By the fact that ||J*||max < 00 and 11^*111 < Csi by (Bl) of Condition A.9, an application of 
Lemmas A. 14 and A. 15 delivers 


1 




i=i 


A22 < \\w — m*||i IIT 


1 b*—00 


= op(l), 


A 23 < 


1=1 


for A: = o((si logd) ^n), a fortiori for A = o(((sVsi) logd) ^-y/n). Hence |J^|_„ — J*I_^| = op(l). □ 


B Auxiliary Lemmas for Estimation 

In this section, we provide the proofs of the technical lemmas for the divide and conquer estimation. 
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Lemma B.l. Suppose X is a, n x d matrix that has independent sub-gaussian rows 
Denote K{XiXf) by S, then we have 

¥ (^W^x^x -J^xh > (<5V<52)^ <exp(-cit2), 

where t > 0, (5 = Ci^Jd/n + t/y/n and Ci and ci are both constants depending only on ||Xj||^ 2 - 
Proof. See Vershynin (2010). □ 


Lemma B.2. (Bernstein-type inequality) Let Xi ,..., X^ be independent centered sub-exponential 
random variables, and M = max ||Xi|L^. Then for every a = (ai,..., an) G M” and every t > 0, 

l<i<n 


we have 


E 

\i=l 




< exp 


C 2 min 




Proof. See Vershynin (2010). □ 

Lemma B.3. Suppose V is a n x d matrix that has independent sub-gaussian rows If 

A mav (E) < Cmax and d <C n, then for all M > Cmax, there exists a constant c > 0 such that when 
n and d are sufficiently large, 


n 


,>m) 


< exp(—cn). 


Proof. Apply Lemma B.l with t = y/cnjcf, where {y/cjci V c/ci) < M — Cmax, and it follows that 

1 


-X^ X - S 
n 


> (d V d"^) < exp(—cn). 


Since d <C n, we obtain (d V d^) —)> -y/cjci, which completes the proof. 


□ 


Lemma B.4. Suppose V is a n x d matrix that has independent sub-gaussian rows {Xi}^^^. 
EXj = 0, Amin(E) > Cmin > 0 and d n. For all m < Cmin, there exists a constant c > 0 such 
that when n and d are sufficiently large. 


n 


1 

> 

m 


^A m^n (^—X^x'^ < < exp(—cn). 


Proof. It is easy to check the following inequality. For any two symmetric and semi-dehnite d x d 
matrices A and B, we have 

■drain(A) ^ -dmin(B) — ||A — B||2 , 


because for any vector x satisfying ||x ||2 = 1, we have ||Aa ;||2 = ||Ba;-|- (A —B)a ;||2 > ||B£c ||2 — 
II(A — B)a ;||2 > Amin(-B) — ||A — B|| 2 . Then it follows that 


1 T '' 

-X^X 

n 


>-] = 
2 m J 

< 


E (^Amin(^^^^) < < E (^Amin(E) - 


1 rj. 

-X^X - s 
n 


> m 


1 rj. 

-X^X - Ex 
n 


> Crnin -m] < exp(-cn). 
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where c satisfies c/ci V c/ci) < Cmin — and the last inequality is an application of Lemma B.l 
with t = s/cnjci. □ 


Lemma B.5. (Hoeffding-type Inequality). Let Xi,... be independent centered sub-gaussian 

random variables, and let K = max ||Xj||,^ 2 - Then for every a = (oi,..., an) G M"" and every t > 0, 

i 

we have 




i=l 


> t < e • exp — 


ct 




Lemma B.6. (Sub-exponential is sub-gaussian squared). A random variable X is a sub-gaussian 
if and only if X^ is sub-exponential. Moreover, 


X 


||2 


< ||X2||^^ <2||X 


||2 


Lemma B.7. Let Xi,... ,Xn be independent centered sub-gaussian random variables. Let k = 
maxj ||Xj ||^2 cr^ = maxjEX?. Suppose > 1, then we have 




n 


i=l 


a'^n 


< exp -C 2 ^y- 


Proof. Combining Lemma B.2 and Lemma B.6 yields the result. 


□ 


Lemma B.8. Following the same notation as in the beginning of Proof of Theorem 4.10, 


r ({ I|1 > */2| nfo) < exp (dlog(6) - 

and 

P {l\\(XD,fe/nh > t/2} nf) < exp (<!log(6) - ■ 

Proof. 


rik 


E ( exp / rik)^ \ X’-T^ “IT®' \ X^^^J 

i=l 

N 

E (exp (A(D 2 v)'^(X^e/n)) | X) = ]^ T (exp {{XXi/N)'^{D 2 'v)ei) \ X) 

i=l 

< exp f^CsA^s? ^ , 


(B.l) 


(B.2) 


i=l 
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where we write and Ai in place of and {Xi)'^D 2 \r respectively C 3 is an absolute 

constant, and the last inequality holds because Si are sub-gaussian. Next we provide an upper 

71 

bound on Y^{A^i Y ^1- Note that 

i=l i=l 

n 

= ^^D?X^Xd[^K = v^((5$))-i - - (S)-i)v 

i=l 

and similarly, 

n 

^ 4 = nv^E-i(S - Sx){Sx)-\T. - 5x)S-iv. 

i=l 

For any r G M, define the event = {||(5'^^)“^||2 < 2/C'min} H — S ||2 < ((5i V hf)} for all 

j = 1 ,... ,k, where = Ci^/d/uk + Tjy/nk, and the event £ = {||(5'x)“^||2 < 2/C'min} n {||5x - 
< (<^2 V 62 )}, where 62 = C\-\fdfn + rjy/n. On £^^^ and £, we have respectively 

r) n 

V 6 jf and ^Aj < -^{62 V 5lf. 

2=1 ^min 2=1 ^min 

Therefore from Equation (B.l) and (B.2) we obtain 

E (^exp(A(T>[^V)'^(x(^)'^e(-^V«fc)) < exp ((^1 V 

and 

E (exp(A(T> 2 v)^(X'^e/n)) IjT}) < exp V . 

In addition, according to Lemma B.l and B.4, the probability of both {£^^'>Y and £‘^ are very 
small. More specifically, 

E(T'^) < exp(—cn) + exp(—cir^) and E((T*''^^)‘^) < exp(—cn/A:) + exp(—cir^). 
k 

Let To := n £^^\ An application of the Chernoff bound trick leads us to the following inequality. 
i=i 



E(«l’ v)^(A(^')'^e(^'))/nfc > t/2 


i=i 


> nTn 


< exp(—At/ 2 ) nK exp 

< exp r-At/2 + ^^ 3 ^ V . 

V ^min” / 


14 










Minimize the right hand side by A, then we have 


r ({ r > */2| nfo) < exp (-, 

Consider the 1/2—net of MP, denoted by AA(l/2). Again it is known that |A^(1/2)| < 6^. Using the 
maximal inequality, we have 


= sup P I 

l|v|| 2 =i y 

< sup P 
veAf(l/2) 

< exp (<!l„g(6) - ' 

Proceeding in an analogous fashion, we obtain 

r ({||(XD.f e/n|b > tftjne) < exp (<ilog(6) - 33^/2/^//,2)2 



□ 


Lemma B.9. Following the same notation as in the proof of Theorem 4.13, 

(~<4 t2 „i2 


P({||B ||2 > ti} n Al) < 2exp (ilog(6) — 


128</.C/2Cp„ax(5l V,52)2; ■ 

Proof. By Lemma A.2, for any A E M and v such that ||v ||2 = 1, we have 

nk 

E [exp{X{D[^Kf{X^^^^e^^^/nk)) \ E (exp((AXy| 


2=1 


rik 


<expUc/A2^(A|^0V 


0U2/^2 


2=1 


and 


E (exp(A(Zl2v)'^(A'^e/n)) | X) = E (exp((AXj/n)'^(T)2v)ei) | X) 


2 = 1 


< 


exp [ (/C/A^^A^/n^ 


2=1 
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where we write and Ai in place of and {Xi)'^D 2 'v respectively. Next we give a 

upper bound on Y^{A^ )‘^ and ^ A^. Note that 
i=l i=l 


i=l 

= - S-i)v 

= nv^S-^(S - sA){SA)-is^i\sA)-^{j: - sA)^-\. 


Similarly, 

n 

= nv'^S-^S - S)S-^SxS-\j: - 5)S-V. 

i=l 

On SA and S, we have respectively 

|^(^fa ))2 ^ (i, V Sff and ^ V «|)2. 

Then it follows that 


E (ex.p{X{D[^K)'^{xA'^eA/nk))t{SA}\ < exp V 

and 

E (exp(A(i:>2v)^(X'^e/n))l{£:}) < exp (*^2 V . 

Now we follow exactly the same steps as in the OLS part. Denote Dj^Aj by So- An application of 
the Chernoff bound technique and the maximal inequality leads us to the following inequality. 

r > </2} n£„) < exp(dl„g(6) - 

and 

/ T 2 a2 \ 

F({UXD, fs/nh > t/2]nE) < exp (<ilog(6) - ' 

We have thus derived an upper bound for ||S ||2 that holds with high probability. Specifically, 


P({||S||2 >fi}n^) <P 

+ p 





||(AD 2 )'^e/n ||2 > ^ O T ) < 2exp 


frflog(6) 


^4 t2 y. 


2 

1 


128</.D2Cp„ax(5l V (52)2 


□ 
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Lemma B.IO. Under Condition 3.6, for r < Lmin/( 8 MCmaxf^ 3 Vd) and sufficiently large n and d 
we have 


P(||/3 — P *\\2 > t) < exp ( dlog 6 — 


„/^2 t 2 _2 
/6v^min-^min' 


+ 2 exp(—cn). 


Proof. The notation is that introduced in the proof of Theorem 4.13. We further define S(/3) := 
K{b”(X'^f3)X as well as the event Ti := {iniP*) > max^ggg^ £„(/3)}, where Br = {P '■ \\P — 
P *\\2 < t}- Note that as long as the event Ti holds, the MLE falls in Br, therefore the proof strategy 
involves showing that ¥(T-L) approaches 1 at certain rate. By the Taylor expansion, 

UP) - UP*) = ip- p*)^v -^iP- p*)^smp - p*) 

= iP- p*)^v - ^(/3 - p*)^SiPUP -P*)-liP- PP^iSiP) - SiPUiP - P*) 

= 4li +4l2, 

where S{P) = {l/n)X"^D{XP)X, P is some vector between P and P*, v = {1 /n)X"^ (Y — fi{XP*)), 

= {P-PUv-(l/2){P-PUS{P*)iP-P*) and ^2 = -{1/2){P-PU{S{P)-S{P*)){P-P*)- 

Define the event S := {Amin [‘S'(/3*)] > Tmin/2}, where Lmin is the same constant in Condition 

3.6. Note that by Condition 3.6 (ii), P)Xi is a sub-gaussian random vector. Then by 

Condition 3.6 (iii) and Lemma B.4, for sufficiently large n and d we have P (T'^) < exp(—cn). 
Therefore on the event T, 

< r(||n||2 - ^^r). 

We next show that, under an appropriate choice of r, |^ 2 | < TminT^/S with high probability. 
We first consider Condition 3.6 (ii). Define J- := {||Af^Af/n ||2 < 2C'max}- By Lemma B.3, we have 
P(T'^) < exp(—cn). By Lemma A.5, on the event X, we have 

A 2 < max \b''{Xfp) - b"{Xfp*)\U..T‘^ 

l<i<n 

<MU^^\\P-pP\2-Cm..T^ 

o 

where the last inequality holds if we choose r < L-amil (SMCmaxt^sV^)- Now we obtain the following 
probabilistic upper bound on which we later prove to be negligible. 


¥{n^) < n T n J^) + p(T'=) + p(j^") 

Tmin"^ 


< 


T 2 > 


nTn^F +P(T'^)+ P(.F"). 


(B.3) 


Since each component of v is a weighted average of i.i.d. random variables, the effect of concentra¬ 
tion tends to make ||v ||2 very small with large probability, which inspires us to study the moment 
generating function and apply the Chernoff bound technique. By Lemma A.2, for any constant 
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u G ||u ||2 = 1 and let ai = Xi, then we have for any t G M, 


i=l 


E(exp(t(u,v)) |X) = ]^E (exp ( —{Yi - n{Xj(3)) 


< exp 


= exp 


4>U2t^ 


n 


E' 


I 2n2 

\ / 

\ 2n n J 



It follows that 

E exp(t(u, v) \{£ n -F}) < exp 
By the Chernoff bound technique, we obtain 

E({(u, v) > e} n F n -F) < exp 


(pCrai,xU2t‘^ 

2n 


ne 


8C'maxb^2<^ 


Consider a 1/2—net of denoted by A^(l/2). Since 

llvlla = max (u, v) < 2 max (u, v), 
||u|| 2 =l ueA''{l/2) 


it follows that 


iP({l|v|| 2 >%^}nFn^)< 


< 6“exp - 


/ V Finin'^ 

max (u, v) > - 

ueAr{l/2)^ ’ ' 16 


nLi-Y 


n£nx 


= exp d log 6 — 


2lVCmaxF2 

„(^2 r 2 2 

* ^rvii nrvi 1 ' 


mm mm 


Finally combining the result above with Equation (B.3) delivers the conclusion. 


□ 


Remark B.ll. Simple calculation shows that when d = o{y/n), \\P — P *\\2 = Op{^/dJn). When 
d is a fixed constant, \\P — P *\\2 = Op{^/lJn). 
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